How to Hash an Entire Redshift Table? - amazon-web-services

I want to hash entire redshift tables in order to check for consistency after upgrades, backups, and other modifications which shouldn't affect table data.
I've found Hashing Tables to Ensure Consistency in Postgres, Redshift and MySQL but the solution still requires spelling out each column name and type so it can't be applied new tables in a generic manner. I'd have to manually change column names and types.
Is there some other function or method by which I could hash / checksum entire tables in order to confirm they are identical? Ideally without spelling out the specific column and column types of that table.

There is certainly no in-built capability in Redshift to hash whole tables.
Also, I'd be a little careful of the method suggested in that article because, from what I can see, it is calculating a hash of all the values in a column but isn't associating the hashed value with a row identifier. Therefore if Row 1 and Row 2 swapped values in a column, the hash wouldn't change. So, it's not strictly calculating an adequate hash (but I could be wrong!).
You could investigate using the new Stored Procedures in Redshift to see whether you can create a generic function that would work for any table.

Related

Redshift DISTSTYLE KEY. Deciding whats the best column to define as KEY

Well I recently got into this area of Redshift, trying to optimize disk usage and performance of my database, and having read lots of information on AWS about the topic, I still have some doubts.
First of all, to my database structure. Per schema, I have 3 master tables, with 3 different IDs, these are now DISTSTLYE ALL tables, being small in size.
Each master table has different amounts of IDs,
the date table --> largest one (#1 most joined)
the store table --> medium one (#3 most joined)
the item table --> smallest one (#2 most joined)
Then I have my core table, which has needed combinations of these IDs to display additional information about them. Anyway, this table should be a DISTSTYLE KEY type, based on my knowledge. Well, which of the 3 IDs should I select to be my DIST KEY?
Whats the criteria for this decision? I understand that for joins I need to look at the Sort Key, well that has been understood and defined to the ID_date, because its the most joined table. So now, what about the distribution per node of this table?
I'm sorry if I'm rambling, I dont want to leave any information out. If I have, feel free to ask! Thanks for taking the time to read!
You'll find the best advice on Amazon Redshift best practices for designing tables. It goes into quite a bit of detail.
However, my rule of thumb is:
The DISTKEY should be the column most used in JOINs between tables
The SORTKEY should be the column most used in WHERE statements
Use DISTSTYLE ALL for small lookup tables

How to remove fields from big query table?

I updated schema of one production bq table and added record type of field with few repeated fields inside another record type field. Now I have to drop this column. I can't delete whole table as it already has more than 30 TB data. Is there any way to drop particular column in big query table?
See:
https://cloud.google.com/bigquery/docs/manually-changing-schemas#deleting_a_column_from_a_table_schema
There are two ways to manually delete a column:
Using a SQL query — Choose this option if you are more concerned about simplicity and ease of use, and you are less concerned about costs.
Recreating the table — Choose this option if you are more concerned about costs, and you are less concerned about simplicity and ease of use.

Compound Sort Key vs. Sort Key

Let me ask other question about redshift sortkey.
We're planning to set the sortkey with the columns frequently used in WHERE statement.
So far, the best combination for our system seems to be:
DISTSTYLE EVEN + COMPOUND SORTKEY + COMPRESSED Column (except for First SortKey column)
Just wondering which can be more better, simple SORTKEY or COMPOUND SORTKEY for our BI tables which can have diversified queries according to users' analysis.
For example, we set the compound sortkey according to frequency in several queries' WHERE statement as follows.
COMPOUND SORTKEY
(
PURCHASE_DATE <-- set as first sort key since it's date column.
STORE_ID,
CUTOMER_ID,
PRODUCT_ID
)
But sometimes it can be queried only 'PRODUCT ID' in actual queries, not with other listed sort keys, nor queried different from COMPOUND KEY order.
In that case, may I ask 'COMPOUND SORTKEY' can be useless or simple SORT KEY can be more effective ...?
I'd be so grateful if you would tell me about your idea and experiences.
The simple rules for Amazon Redshift are:
Use DISTKEY on the column that is most frequently used with JOIN
Use SORTKEY on the column(s) that is most frequently used with WHERE
You are correct that the above compound sort key would only be used if PURCHASE_DATE is included in the WHERE.
An alternative is to use Interleaved Sort Keys, which give equal weighting to many columns and can be used where different fields are often used in the WHERE. However, Interleaved Sort Keys are much slower to VACUUM and are rarely worth using.
So, aim to use SORTKEY on most of your queries, but don't worry too much about the other queries unless you are having some particular performance problems.
See: Redshift Sort Keys - Choosing Best Sort Style | Hevo Blog
Your compound sort key looks sensible to me. It's important to understand that Redshift sort keys are not an index which is used or not used. The sort key is used to physically arrange the data on disk.
The query optimizer "uses" the sort key by looking at the "zone map" (min and max values) for each block during query execution. This happens for all columns regardless of whether they are in the sort key.
Secondary columns in a compound sort key can still be very effective at reducing the data that has to be scanned from disk, especially when the column values are low cardinality.
See this previous example for a query to check on sort key effectiveness: Is my sort key being used?
Please review our guide for designing tables effectively: "Amazon Redshift Engineering’s Advanced Table Design Playbook". The guide discusses the correct use of Interleaved sort keys but note that they should only be used in very specific circumstances.

Redshift sequential scan on columns defined as sort keys

Referring to the attached image of the query plan of the query I'm executing and created_time is Interleaved sort key which is used as a range filter on the data.
Though it \looks like seq scan is happening on the table data, the rows scanned column seems empty in the image, does that mean there is no scan happening, and the sort keys work?
Even though created_time is your sort key, in this query it is not recognized since you are converting it to date. Therefore, it is scanning entire table.
You need to leave it unmodified for it to know that it is the sort key.

Redshift performance: encoding on join column

would encoding on join column corrupts the query performance ? I let the "COPY command" to decide the encoding type.
In gernal no - since an encoding on your DIST KEY will even have a positive impact due to the reduction disk I/O.
According to the AWS table design playbook There are a few edge case were indeed an encoding on your DIST KEY will corrupt your query performance:
Your query patterns apply range restricted scans to a column that is
very well compressed.
The well compressed column’s blocks each contain a large number of values per block, typically many more values than the actual count of values your query is interested in.
The other columns necessary for the query pattern are large or don’t compress well. These columns are > 10x the size of the well
compressed column.
If you want to find the optimal encoding for your table you can use the Redshift column encoding utility.
Amazon Redshift is a column-oriented database, which means that rather than organising data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. This architecture is particularly well suited to analytics queries on tables with a large number of columns, where most queries only access a subset of all possible dimensions and measures. Amazon Redshift is able to only access those blocks on disk that are for columns included in the SELECT or WHERE clause, and doesn’t have to read all table data to evaluate a query. Data stored by column should also be encoded , which means that it is heavily compressed to offer high read performance. This further means that Amazon Redshift doesn’t require the creation and maintenance of indexes: every column is almost like its own index, with just the right structure for the data being stored.
Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied.
So your question it will not corrupt the query performance but not a best practice.
There are a couple of details on this by AWS respondants:
AWS Redshift : DISTKEY / SORTKEY columns should be compressed?
Generally:
DISTKEY can be compressed but the first SORTKEY column should be uncompressed (ENCODE raw).
If you have multiple sort keys (compound) the other sort key columns can be compressed.
Also, generally recommend using a commonly filtered date/timestamp column,
(if one exists) as the first sort key column in a compound sort key.
Finally, if you are joining between very large tables try using the same dist
and sort keys on both tables so Redshift can use a faster merge join.
Based on this, i think as long as both sides of the join have the same compression, i think redshift will join on the compressed value safely.