Should I use a column both as distkey and sortkey - amazon-web-services

I have a table in redshift containing billion records ( log file entries ). It has a timestamp column ts on which I have distkey and sortkey. Following query:
select ts from apilogs where date(ts) = '2016-09-08' limit 10;
runs super fast when I query for old date; but not for latest date ! Not sure why ! Any help is appreciated
How I put logs: I had put all old log files in one shot into this table; while every incremental log files I put hourly.
When I checked detailed plan on AWS console; I can see that query taking long time is scanning all billion rows; while query taking few milliseconds is scanning only few thousands of rows ( i.e. rows corresponding to that date )..
So, now question is why it is scanning whole table for latest timestamp !

Dist key and sort key can be on the same column. No Problem!
Your latest data load in your log table, was it sorted according to the sort key? If not, you will have to run vacuum on your log table, so that your sort key column is sorted in that order and redshift does not have to scan unnecessary rows.
Run the below query to check if you have any unsorted region in your table.
select trim(pgdb.datname) as Database,
trim(a.name) as Table, ((b.mbytes/part.total::decimal)*100)::decimal(5,2) as pct_of_total, b.mbytes, b.unsorted_mbytes, (unsorted_mbytes/mbytes::decimal)*100 as unsorted_pct
from stv_tbl_perm a
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, sum(decode(unsorted, 1, 1, 0)) as unsorted_mbytes, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
join ( select sum(capacity) as total
from stv_partitions where part_begin=0 ) as part on 1=1
where a.slice=0 and a.name in ('apilogs')
order by 3 desc, db_id, name;
If you have unsorted region, run
Vacuum apilogs to 100 percent

It seems like you haven't run vacuum on your table, after you added rows for latest timestamp.
Here's the part, most relevant to your use case, from the Redshift documentation:
When data is initially loaded into a table that has a sort key, the data is sorted according to the SORTKEY specification in the CREATE TABLE statement. However, when you update the table, using COPY, INSERT, or UPDATE statements, new rows are stored in a separate unsorted region on disk, then sorted on demand for queries as required. If large numbers of rows remain unsorted on disk, query performance might be degraded for operations that rely on sorted data, such as range-restricted scans or merge joins. The VACUUM command merges new rows with existing sorted rows, so range-restricted scans are more efficient and the execution engine doesn't need to sort rows on demand during query execution.
P.S.- You shouldn't be worried about your distribution key here, as they come into picture only during joins.

Related

Is there a good way to import unsorted data into QuestDB in bulk?

I have a 100GB data file with sensor readings spanning a few weeks. The timestamps are not in strict chronological order and I'd like to bulk load the data into QuestDB. The order is not completely random, but there is a deviation of up to three minutes of lateness where some rows are 3 minutes late.
Is there an efficient way to do bulk loading like this and ensure that the data is ordered chronologically at the same time?
The most efficient way to do this is in a 3-step phase
Import the unordered dataset, you can do this via curl:
curl -F data=#unordered-data.csv 'http://localhost:9000/imp'
Create a table with the schema of the imported data and apply a partitioning
strategy. The
timestamp column may be cast as a timestamp if auto detection of the timestamp failed:
CREATE TABLE ordered AS (
SELECT
cast(timestamp AS timestamp) timestamp,
col1,
col2
FROM 'unordered-data.csv' WHERE 1 != 1
) timestamp(timestamp) PARTITION BY DAY;
Insert the unordered records into the partitioned table and provide a lag
and batch size:
INSERT batch 100000 lag 180000000 INTO ordered
SELECT
cast(timestamp AS timestamp) timestamp,
col1,
col2
FROM 'unordered-data.csv';
To confirm that the table is ordered, the isOrdered() function may be used:
select isOrdered(timestamp) from ordered
isOrdered
true
There is more info on loading data in this way on the CSV import documentation
lag can be about 3 minutes in your case, it's the expected lateness of records
batch is the number of records to batch process at one time

AWS Athena query on parquet file - using columns in where clause

We are planning to use Athena as a backend service for our data(stored as parquet files in partitions) in S3.
Some of the things we are interested to find out is how does adding additional columns in where clause of the query affect the query run time.
For example, we have 10million records in one hive partition(partition based on column 'date')
And all queries below return same volume - 10million. would all these queries take same time or does it reduce query run when we add additional columns in where clause(as parquet is columnar fomar)?
I tried to test this but results were not consistent as there was some queuing time as well I guess
select * from table where date='20200712'
select * from table where date='20200712' and type='XXX'
select * from table where date='20200712' and type='XXX' and subtype='YYY'
Parquet file contains page "indexes" (min, max and bloom filters.) If you sorting the data by columns in question during insert for example like this:
insert overwrite table mytable partition (dt)
select col1, --some columns
type,
subtype,
dt
distribute by dt
sort by type, subtype
then these indexes may work efficiently because data withe the same type, subtype will be loaded into the same pages, data pages will be selected using indexes. See some benchmarks here: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
Switch-on predicate-push-down: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_predicate_pushdown_parquet.html

Prevent duplicates on insert with billions of rows in SQL Data Warehouse?

I'm trying to determine if there's a practical way to prevent duplicate rows from being inserted into a table using Azure SQL DW when the table already holds billions of rows (say 20 billion).
The root cause of needing this is that the source of the data is a third party that sends over supposedly unique data, but sometimes sends duplicates which have no identifying key. I unfortunately have no idea if we've already received the data they're sending.
What I've tried is to create a table that contains a row hash column (pre-calculated from several other columns) and distribute the data based on that row hash. For example:
CREATE TABLE [SomeFact]
(
Row_key BIGINT NOT NULL IDENTITY,
EventDate DATETIME NOT NULL,
EmailAddress NVARCHAR(200) NOT NULL,
-- other rows
RowHash BINARY(16) NOT NULL
)
WITH
(
DISTRIBUTION = HASH(RowHash)
)
The insert SQL is approximately:
INSERT INTO [SomeFact]
(
EmailAddress,
EventDate,
-- Other rows
RowHash
)
SELECT
temp.EmailAddress,
temp.EventDate,
-- Other rows
temp.RowHash
FROM #StagingTable temp
WHERE NOT EXISTS (SELECT 1 FROM [SomeFact] f WHERE f.RowHash = temp.RowHash);
Unfortunately, this is just too slow. I added some statistics and even created a secondary index on RowHash and inserts of any real size (10 million rows, for example) won't run successfully without erroring due to transaction sizes. I've also tried batches of 50,000 and those too are simply too slow.
Two things I can think of that wouldn't have the singleton records you have in your query would be to
Outer join your staging table with the fact table and filter on some NULL values. Assuming You're using Clustered Column Store in your fact table this should be a lot more inexpensive than the above.
Do a CTAS with a Select Distinct from the existing fact table, and a Select Distinct from the staging table joined together with a UNION.
My gut says the first option will be faster, but you'll probably want to look at the query plan and test both approaches.
Can you partition the 'main' table by EventDate and, assuming new data has a recent EventDate, CTAS out only the partitions that include the EventDate's of the new data, then 'Merge' the data with CTAS / UNION of the 'old' and 'new' data into a table with the same partition schema (UNION will remove the duplicates) or use the INSERT method you developed against the smaller table, then swap the partition(s) back into the 'main' table.
Note - There is a new option on the partition swap command that allows you to directly 'swap in' a partition in one step: "WITH (TRUNCATE_TARGET = ON)".

Athena: Minimize data scanned by query including JOIN operation

Let there be an external table in Athena which points to a large amount of data stored in parquet format on s3. It contains a lot of columns and is partitioned on a field called 'timeid'. Now, there's another external table (small one) which maps timeid to date.
When the smaller table is also partitioned on timeid and we join them on their partition id (timeid) and put date into where clause, only those specific records are scanned from large table which contain timeids corresponding to that date. The entire data is not scanned here.
However, if the smaller table is not partitioned on timeid, full data scan takes place even in the presence of condition on date column.
Is there a way to avoid full data scan even when the large partitioned table is joined with an unpartitioned small table? This is required because the small table contains only one record per timeid and it might not be expected to create a separate file for each.
That's an interesting discovery!
You might be able to avoid the large scan by using a sub-query instead of a join.
Instead of:
SELECT ...
FROM large-table
JOIN small-table
WHERE small-table.date > '2017-08-03'
you might be able to use:
SELECT ...
FROM large-table
WHERE large-table.date IN
(SELECT date from small-table
WHERE date > '2017-08-03')
I haven't tested it, but that would avoid the JOIN you mention.

Redshift UPDATE prohibitively slow

I have a table in a Redshift cluster with ~1 billion rows. I have a job that tries to update some column values based on some filter. Updating anything at all in this table is incredibly slow. Here's an example:
SELECT col1, col2, col3
FROM SOMETABLE
WHERE col1 = 'a value of col1'
AND col2 = 12;
The above query returns in less than a second, because I have sortkeys on col1 and col2. There is only one row that meets this criteria, so the result set is just one row. However, if I run:
UPDATE SOMETABLE
SET col3 = 20
WHERE col1 = 'a value of col1'
AND col2 = 12;
This query takes an unknown amount of time (I stopped it after 20 minutes). Again, it should be updating one column value of one row.
I have also tried to follow the documentation here: http://docs.aws.amazon.com/redshift/latest/dg/merge-specify-a-column-list.html, which talks about creating a temporary staging table to update the main table, but got the same results.
Any idea what is going on here?
You didn't mention what percentage of the table you're updating but it's important to note that an UPDATE in Redshift is a 2 step process:
Each row that will be changed must be first marked for deletion
Then a new version of the data must be written for each column in the table
If you have a large number of columns and/or are updating a large number of rows then this process can be very labor intensive for the database.
You could experiment with using a CREATE TABLE AS statement to create a new "updated" version of the table and then dropping the existing table and renaming the new table. This has the added benefit of leaving you with a fully sorted table.
Actually I don't think RedShift is designed for bulk updates, RedShift is designed for OLAP instead of OLTP, update operations are inefficient on RedShift by nature.
In this use case, I would suggest to do INSERT instead of UPDATE, while add another column of the TIMESTAMP, and when you do analysis on RedShift, you'll need extra logic to get the latest TIMESTAMP to eliminate possible duplicated data entries.