Redshift sequential scan on columns defined as sort keys - amazon-web-services

Referring to the attached image of the query plan of the query I'm executing and created_time is Interleaved sort key which is used as a range filter on the data.
Though it \looks like seq scan is happening on the table data, the rows scanned column seems empty in the image, does that mean there is no scan happening, and the sort keys work?

Even though created_time is your sort key, in this query it is not recognized since you are converting it to date. Therefore, it is scanning entire table.
You need to leave it unmodified for it to know that it is the sort key.

Related

I would like to know about BigQuery output destination tables

BigQuery scheduled queries and I would like to know one point when spitting out the output of a select statement into a separate table.
I am new to Google Cloud Platform.
The last table that outputs using order by will not have the Is the result of the output spit out as it is?
I would like to write to the table of output results of the schedule query after sorting by order by id. Is that possible?
Sorry for the rudimentary question, but thank you in advance.
BigQuery tables don't guarantee a certain order of its records. Unless you're using clustering - this will physically sort the data in the background. But this will also not guarantee a sorted output of a query. You have to use ORDER BY in order to have a sorted output guaranteed. That means you need a key by which you can sort.

need to order query results when inserting into table with sort key?

I'm responsible for process that truncates a Redshift table then repopulates it via a select query from another table.
The target table (that is truncated and reloaded) has a sort key.
My understanding was that I needed to use an "order by" on the select so that the data goes into the (empty) target table in sorted order, but I'm seeing some behavior that suggests that might not be the case.
To test, I took an existing table that was ~70% unsorted (as reported by svv_table_info). I created a new table with the exact same structure, including diststyle and sort key, then populated it by a "select *" from the unsorted table with no "order by" clause.
The new table showed up as 0% unsorted in svv_table_info, i.e. it's apparently sorted.
How/why is that possible?
Looks like bulk inserts automatically sort the incoming data using the destination table's sort key, then place it in the unsorted region. Found this in the docs for "deep copy":
"A deep copy recreates and repopulates a table by using a bulk insert, which automatically sorts the table."
https://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html

How to Hash an Entire Redshift Table?

I want to hash entire redshift tables in order to check for consistency after upgrades, backups, and other modifications which shouldn't affect table data.
I've found Hashing Tables to Ensure Consistency in Postgres, Redshift and MySQL but the solution still requires spelling out each column name and type so it can't be applied new tables in a generic manner. I'd have to manually change column names and types.
Is there some other function or method by which I could hash / checksum entire tables in order to confirm they are identical? Ideally without spelling out the specific column and column types of that table.
There is certainly no in-built capability in Redshift to hash whole tables.
Also, I'd be a little careful of the method suggested in that article because, from what I can see, it is calculating a hash of all the values in a column but isn't associating the hashed value with a row identifier. Therefore if Row 1 and Row 2 swapped values in a column, the hash wouldn't change. So, it's not strictly calculating an adequate hash (but I could be wrong!).
You could investigate using the new Stored Procedures in Redshift to see whether you can create a generic function that would work for any table.

Compound Sort Key vs. Sort Key

Let me ask other question about redshift sortkey.
We're planning to set the sortkey with the columns frequently used in WHERE statement.
So far, the best combination for our system seems to be:
DISTSTYLE EVEN + COMPOUND SORTKEY + COMPRESSED Column (except for First SortKey column)
Just wondering which can be more better, simple SORTKEY or COMPOUND SORTKEY for our BI tables which can have diversified queries according to users' analysis.
For example, we set the compound sortkey according to frequency in several queries' WHERE statement as follows.
COMPOUND SORTKEY
(
PURCHASE_DATE <-- set as first sort key since it's date column.
STORE_ID,
CUTOMER_ID,
PRODUCT_ID
)
But sometimes it can be queried only 'PRODUCT ID' in actual queries, not with other listed sort keys, nor queried different from COMPOUND KEY order.
In that case, may I ask 'COMPOUND SORTKEY' can be useless or simple SORT KEY can be more effective ...?
I'd be so grateful if you would tell me about your idea and experiences.
The simple rules for Amazon Redshift are:
Use DISTKEY on the column that is most frequently used with JOIN
Use SORTKEY on the column(s) that is most frequently used with WHERE
You are correct that the above compound sort key would only be used if PURCHASE_DATE is included in the WHERE.
An alternative is to use Interleaved Sort Keys, which give equal weighting to many columns and can be used where different fields are often used in the WHERE. However, Interleaved Sort Keys are much slower to VACUUM and are rarely worth using.
So, aim to use SORTKEY on most of your queries, but don't worry too much about the other queries unless you are having some particular performance problems.
See: Redshift Sort Keys - Choosing Best Sort Style | Hevo Blog
Your compound sort key looks sensible to me. It's important to understand that Redshift sort keys are not an index which is used or not used. The sort key is used to physically arrange the data on disk.
The query optimizer "uses" the sort key by looking at the "zone map" (min and max values) for each block during query execution. This happens for all columns regardless of whether they are in the sort key.
Secondary columns in a compound sort key can still be very effective at reducing the data that has to be scanned from disk, especially when the column values are low cardinality.
See this previous example for a query to check on sort key effectiveness: Is my sort key being used?
Please review our guide for designing tables effectively: "Amazon Redshift Engineering’s Advanced Table Design Playbook". The guide discusses the correct use of Interleaved sort keys but note that they should only be used in very specific circumstances.

Athena: Minimize data scanned by query including JOIN operation

Let there be an external table in Athena which points to a large amount of data stored in parquet format on s3. It contains a lot of columns and is partitioned on a field called 'timeid'. Now, there's another external table (small one) which maps timeid to date.
When the smaller table is also partitioned on timeid and we join them on their partition id (timeid) and put date into where clause, only those specific records are scanned from large table which contain timeids corresponding to that date. The entire data is not scanned here.
However, if the smaller table is not partitioned on timeid, full data scan takes place even in the presence of condition on date column.
Is there a way to avoid full data scan even when the large partitioned table is joined with an unpartitioned small table? This is required because the small table contains only one record per timeid and it might not be expected to create a separate file for each.
That's an interesting discovery!
You might be able to avoid the large scan by using a sub-query instead of a join.
Instead of:
SELECT ...
FROM large-table
JOIN small-table
WHERE small-table.date > '2017-08-03'
you might be able to use:
SELECT ...
FROM large-table
WHERE large-table.date IN
(SELECT date from small-table
WHERE date > '2017-08-03')
I haven't tested it, but that would avoid the JOIN you mention.