Azure SQL DW CTAS of over 102,400 rows to one distribution doesn't automatically compress - azure-sqldw

I thought the way columnstores worked was that if you bulk load over 102,400 rows into one distribution of a columnstore, it would automatically compress it. I'm not observing that in Azure SQL DW.
I'm doing the following CTAS statement:
create table ColumnstoreDemoCTAS
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION=HASH(Column1))
AS
select top 102401 cast(1 as int) as Column1, f.*
from FactInternetSales f
cross join sys.objects o1
cross join sys.objects o2
Now I check the status of the columnstore row groups:
select t.name
,NI.distribution_id
,CSRowGroups.state_description
,CSRowGroups.total_rows
,CSRowGroups.deleted_rows
FROM sys.tables AS t
JOIN sys.indexes AS i
ON t.object_id = i.object_id
JOIN sys.pdw_index_mappings AS IndexMap
ON i.object_id = IndexMap.object_id
AND i.index_id = IndexMap.index_id
JOIN sys.pdw_nodes_indexes AS NI
ON IndexMap.physical_name = NI.name
AND IndexMap.index_id = NI.index_id
LEFT JOIN sys.pdw_nodes_column_store_row_groups AS CSRowGroups
ON CSRowGroups.object_id = NI.object_id
AND CSRowGroups.pdw_node_id = NI.pdw_node_id
AND CSRowGroups.distribution_id = NI.distribution_id
AND CSRowGroups.index_id = NI.index_id
WHERE t.name = 'ColumnstoreDemoCTAS'
ORDER BY 1,2,3,4 desc;
I end up with one OPEN rowgroup with 102401 rows. Did I misunderstand this behavior of columnstores? Is Azure SQL DW different?
I see the same behavior if I do an bulk insert from SSIS of the same number of rows all as one buffer.
I tried Drew's suggestion of inserting over 6.5 million rows and I still end up with all OPEN row stores:
create table ColumnstoreDemoWide
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION=HASH(Column1))
AS
select top 7000000 ROW_NUMBER() OVER (ORDER BY f.ProductKey) as Column1, f.*
from FactInternetSales f
cross join sys.objects o
cross join sys.objects o2
cross join sys.objects o3

placing your data in a clustered columnstore will not decrease the number of rows returned. Instead, it will compress the data stored so that it takes up less space on disk. This will mean that less data is moved for queries and you will be charged less for storage, but your results will stay the same. That being said, your data is currently residing in a deltastore, so you will not see any compression. Due to SQL DW's architecture we separate the data into a number of groups under the covers. This allows us to more easily parallelize computations and scale, but also means that each group will have it's own columnstore/deltastore, so you will need to load more rows to get the compression benefits.
In addition to the distribution structure there is a difference in thresholds for SQL Server when compared to SQL Data Warehouse. For DW the threshold was 1,048,576 until a defect was resolved as #JRJ describes. Now Azure SQL DW's threshold is 120,400 like the rest of the SQL family. Once your rows in a distribution exceeds this you should see that your rows are compressed.
You can find a bit more information on loading into a columnstore here: https://msdn.microsoft.com/en-US/library/dn935008.aspx

This was a defect in the service. The fix is currently being rolled out. If you try this out on Japan West for example you will see that the behaviour is as you would expect.

Related

Prevent duplicates on insert with billions of rows in SQL Data Warehouse?

I'm trying to determine if there's a practical way to prevent duplicate rows from being inserted into a table using Azure SQL DW when the table already holds billions of rows (say 20 billion).
The root cause of needing this is that the source of the data is a third party that sends over supposedly unique data, but sometimes sends duplicates which have no identifying key. I unfortunately have no idea if we've already received the data they're sending.
What I've tried is to create a table that contains a row hash column (pre-calculated from several other columns) and distribute the data based on that row hash. For example:
CREATE TABLE [SomeFact]
(
Row_key BIGINT NOT NULL IDENTITY,
EventDate DATETIME NOT NULL,
EmailAddress NVARCHAR(200) NOT NULL,
-- other rows
RowHash BINARY(16) NOT NULL
)
WITH
(
DISTRIBUTION = HASH(RowHash)
)
The insert SQL is approximately:
INSERT INTO [SomeFact]
(
EmailAddress,
EventDate,
-- Other rows
RowHash
)
SELECT
temp.EmailAddress,
temp.EventDate,
-- Other rows
temp.RowHash
FROM #StagingTable temp
WHERE NOT EXISTS (SELECT 1 FROM [SomeFact] f WHERE f.RowHash = temp.RowHash);
Unfortunately, this is just too slow. I added some statistics and even created a secondary index on RowHash and inserts of any real size (10 million rows, for example) won't run successfully without erroring due to transaction sizes. I've also tried batches of 50,000 and those too are simply too slow.
Two things I can think of that wouldn't have the singleton records you have in your query would be to
Outer join your staging table with the fact table and filter on some NULL values. Assuming You're using Clustered Column Store in your fact table this should be a lot more inexpensive than the above.
Do a CTAS with a Select Distinct from the existing fact table, and a Select Distinct from the staging table joined together with a UNION.
My gut says the first option will be faster, but you'll probably want to look at the query plan and test both approaches.
Can you partition the 'main' table by EventDate and, assuming new data has a recent EventDate, CTAS out only the partitions that include the EventDate's of the new data, then 'Merge' the data with CTAS / UNION of the 'old' and 'new' data into a table with the same partition schema (UNION will remove the duplicates) or use the INSERT method you developed against the smaller table, then swap the partition(s) back into the 'main' table.
Note - There is a new option on the partition swap command that allows you to directly 'swap in' a partition in one step: "WITH (TRUNCATE_TARGET = ON)".

Redshift: Aggregate data on large number of dimensions is slow

I have an Amazon redshift table with about 400M records and 100 columns - 80 dimensions and 20 metrics.
Table is distributed by 1 of the high cardinality dimension columns and includes a couple of high cardinality columns in sort key.
A simple aggregate query:
Select dim1, dim2...dim60, sum(met1),...sum(met15)
From my table
Group by dim1...dim60
is taking too long. The explain plan looks simple just a sequential scan and hashaggregate on the able. Any recommendations on how I can optimize it?
1) If your table is heavily denormalized (your 80 dimensions are in fact 20 dimensions with 4 attributes each) it is faster to group by dimension keys only, and if you really need all dimension attributes join the aggregated result back to dimension tables to get them, like this:
with
groups as (
select dim1_id,dim2_id,...,dim20_id,sum(met1),sum(met2)
from my_table
group by 1,2,...,20
)
select *
from groups
join dim1_table
using (dim1_id)
join dim2_table
using (dim2_id)
...
join dim20_table
using (dim20_id)
If you don't want to normalize your table and you like that a single row has all pieces of information it's fine to keep it as is since in a column database they won't slow the queries down if you don't use them. But grouping by 80 columns is definitely inefficient and has to be "pseudo-normalized" in the query.
2) if your dimensions are hierarchical you can group by the lowest level only and then join higher level dimension attributes. For example, if you have country, country region and city with 4 attributes each there's no need to group by 12 attributes, all you can do is group by city ID and then join city's attributes, country region and country tables to the city ID of each group
3) you can have the combination of dimension IDs with some delimiter like - in a separate varchar column and use that as a sort key
Sequential scans are quite normal for Amazon Redshift. Instead of using indexes (which themselves would be Big Data), Redshift uses parallel clusters, compression and columnar storage to provide fast queries.
Normally, optimization is done via:
DISTKEY: Typically used on the most-JOINed column (or most GROUPed column) to localize joined data on the same node.
SORTKEY: Typically used for fields that most commonly appear in WHERE statements to quickly skip over storage blocks that do not contain relevant data.
Compression: Redshift automatically compresses data, but over time the skew of data could change, making another compression type more optimal.
Your query is quite unusual in that you are using GROUP BY on 60 columns across all rows in the table. This is not a typical Data Warehousing query (where rows are normally limited by WHERE and tables are connected by JOIN).
I would recommend experimenting with fewer GROUP BY columns and breaking the query down into several smaller queries via a WHERE clause to determine what is occupying most of the time. Worst case, you could run the results nightly and store them in a table for later querying.

Should I use a column both as distkey and sortkey

I have a table in redshift containing billion records ( log file entries ). It has a timestamp column ts on which I have distkey and sortkey. Following query:
select ts from apilogs where date(ts) = '2016-09-08' limit 10;
runs super fast when I query for old date; but not for latest date ! Not sure why ! Any help is appreciated
How I put logs: I had put all old log files in one shot into this table; while every incremental log files I put hourly.
When I checked detailed plan on AWS console; I can see that query taking long time is scanning all billion rows; while query taking few milliseconds is scanning only few thousands of rows ( i.e. rows corresponding to that date )..
So, now question is why it is scanning whole table for latest timestamp !
Dist key and sort key can be on the same column. No Problem!
Your latest data load in your log table, was it sorted according to the sort key? If not, you will have to run vacuum on your log table, so that your sort key column is sorted in that order and redshift does not have to scan unnecessary rows.
Run the below query to check if you have any unsorted region in your table.
select trim(pgdb.datname) as Database,
trim(a.name) as Table, ((b.mbytes/part.total::decimal)*100)::decimal(5,2) as pct_of_total, b.mbytes, b.unsorted_mbytes, (unsorted_mbytes/mbytes::decimal)*100 as unsorted_pct
from stv_tbl_perm a
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, sum(decode(unsorted, 1, 1, 0)) as unsorted_mbytes, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
join ( select sum(capacity) as total
from stv_partitions where part_begin=0 ) as part on 1=1
where a.slice=0 and a.name in ('apilogs')
order by 3 desc, db_id, name;
If you have unsorted region, run
Vacuum apilogs to 100 percent
It seems like you haven't run vacuum on your table, after you added rows for latest timestamp.
Here's the part, most relevant to your use case, from the Redshift documentation:
When data is initially loaded into a table that has a sort key, the data is sorted according to the SORTKEY specification in the CREATE TABLE statement. However, when you update the table, using COPY, INSERT, or UPDATE statements, new rows are stored in a separate unsorted region on disk, then sorted on demand for queries as required. If large numbers of rows remain unsorted on disk, query performance might be degraded for operations that rely on sorted data, such as range-restricted scans or merge joins. The VACUUM command merges new rows with existing sorted rows, so range-restricted scans are more efficient and the execution engine doesn't need to sort rows on demand during query execution.
P.S.- You shouldn't be worried about your distribution key here, as they come into picture only during joins.

Redshift UPDATE prohibitively slow

I have a table in a Redshift cluster with ~1 billion rows. I have a job that tries to update some column values based on some filter. Updating anything at all in this table is incredibly slow. Here's an example:
SELECT col1, col2, col3
FROM SOMETABLE
WHERE col1 = 'a value of col1'
AND col2 = 12;
The above query returns in less than a second, because I have sortkeys on col1 and col2. There is only one row that meets this criteria, so the result set is just one row. However, if I run:
UPDATE SOMETABLE
SET col3 = 20
WHERE col1 = 'a value of col1'
AND col2 = 12;
This query takes an unknown amount of time (I stopped it after 20 minutes). Again, it should be updating one column value of one row.
I have also tried to follow the documentation here: http://docs.aws.amazon.com/redshift/latest/dg/merge-specify-a-column-list.html, which talks about creating a temporary staging table to update the main table, but got the same results.
Any idea what is going on here?
You didn't mention what percentage of the table you're updating but it's important to note that an UPDATE in Redshift is a 2 step process:
Each row that will be changed must be first marked for deletion
Then a new version of the data must be written for each column in the table
If you have a large number of columns and/or are updating a large number of rows then this process can be very labor intensive for the database.
You could experiment with using a CREATE TABLE AS statement to create a new "updated" version of the table and then dropping the existing table and renaming the new table. This has the added benefit of leaving you with a fully sorted table.
Actually I don't think RedShift is designed for bulk updates, RedShift is designed for OLAP instead of OLTP, update operations are inefficient on RedShift by nature.
In this use case, I would suggest to do INSERT instead of UPDATE, while add another column of the TIMESTAMP, and when you do analysis on RedShift, you'll need extra logic to get the latest TIMESTAMP to eliminate possible duplicated data entries.

Creating pairwise combination of ids of a very large table in bigquery

I have a very large table of ids (string) that has 424,970 rows and only a single column.
I am trying to create the combination of those ids in a new table. The motivation for creation of that table can be found in this question.
I tried the following query to create the pairwise combination table:
#standardSQL
SELECT
t1.id AS id_1,
t2.id AS id_2
FROM
`project.dataset.id_vectors` t1
INNER JOIN
`project.dataset.id_vectors` t2
ON
t1.id < t2.id
But the query fails after 15 minutes, with the following error message:
Query exceeded resource limits. 602467.2409093559 CPU seconds were used, and this query must use less than 3000.0 CPU seconds. (error code: billingTierLimitExceeded)
Is there any workaround to run the query and get the desired output table with all combination of ids?
You can try splitting your table T into 2 smaller tables T1 and T2, then perform 4 joins for each of the smaller tables T1:T1, T1:T2, T2:T1, T2:T2, then union the results. This will be equivalent to joining T with itself. If it still fails try breaking it down into even smaller tables.
Alternatively set maximumBillingTier to a higher value https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs.
configuration.query.maximumBillingTier - Limits the billing tier for
this job. Queries that have resource usage beyond this tier will fail
(without incurring a charge). If unspecified, this will be set to your
project default.
If using Java, it can be set in JobQueryConfiguration. This configuration property is not supported in the UI console at the moment.
In order to split a table you can use FARM_FINGERPRINT function in BigQuery. E.g. the 1st part will have a filter:
where mod(abs(farm_fingerprint(id)), 10) < 5
And the 2nd part will the filter:
where mod(abs(farm_fingerprint(id)), 10) >= 5