Prevent duplicates on insert with billions of rows in SQL Data Warehouse? - azure-sqldw

I'm trying to determine if there's a practical way to prevent duplicate rows from being inserted into a table using Azure SQL DW when the table already holds billions of rows (say 20 billion).
The root cause of needing this is that the source of the data is a third party that sends over supposedly unique data, but sometimes sends duplicates which have no identifying key. I unfortunately have no idea if we've already received the data they're sending.
What I've tried is to create a table that contains a row hash column (pre-calculated from several other columns) and distribute the data based on that row hash. For example:
CREATE TABLE [SomeFact]
(
Row_key BIGINT NOT NULL IDENTITY,
EventDate DATETIME NOT NULL,
EmailAddress NVARCHAR(200) NOT NULL,
-- other rows
RowHash BINARY(16) NOT NULL
)
WITH
(
DISTRIBUTION = HASH(RowHash)
)
The insert SQL is approximately:
INSERT INTO [SomeFact]
(
EmailAddress,
EventDate,
-- Other rows
RowHash
)
SELECT
temp.EmailAddress,
temp.EventDate,
-- Other rows
temp.RowHash
FROM #StagingTable temp
WHERE NOT EXISTS (SELECT 1 FROM [SomeFact] f WHERE f.RowHash = temp.RowHash);
Unfortunately, this is just too slow. I added some statistics and even created a secondary index on RowHash and inserts of any real size (10 million rows, for example) won't run successfully without erroring due to transaction sizes. I've also tried batches of 50,000 and those too are simply too slow.

Two things I can think of that wouldn't have the singleton records you have in your query would be to
Outer join your staging table with the fact table and filter on some NULL values. Assuming You're using Clustered Column Store in your fact table this should be a lot more inexpensive than the above.
Do a CTAS with a Select Distinct from the existing fact table, and a Select Distinct from the staging table joined together with a UNION.
My gut says the first option will be faster, but you'll probably want to look at the query plan and test both approaches.

Can you partition the 'main' table by EventDate and, assuming new data has a recent EventDate, CTAS out only the partitions that include the EventDate's of the new data, then 'Merge' the data with CTAS / UNION of the 'old' and 'new' data into a table with the same partition schema (UNION will remove the duplicates) or use the INSERT method you developed against the smaller table, then swap the partition(s) back into the 'main' table.
Note - There is a new option on the partition swap command that allows you to directly 'swap in' a partition in one step: "WITH (TRUNCATE_TARGET = ON)".

Related

What is best practice to deal with missing data according to Kimball?

I have a data base with the following tables:
Customers, Invoices, Salesman, Target.
The ones concerned about my question are Customers, Invoices.
There are customersIDs used in the Invoices but doesn't exist in the Customers table.
If I used only the customers from Customers Table, my customer dimension would be incomplete.
My solution is to append these IDs from Invoices to Customers and fill other columns in the Customers table with nulls.
I don't know if this is the best approche according to Kimball?
also, if it is a good solution, how can I add accomplish it with Power bi desktop?
Customers table: "generated Data"
Invoice table:
..... just a sample the table is thousands of rows.
There's two points here:
Firstly, (in import mode at least) PBI already creates the "blank row" for items present in your fact table but missing from your dimension table for precisely this scenario. If you don't need the granularity of each individual missing customer id, then you don't need to do anything.
Secondly, if you need to to retain that granularity then your approach is the correct one. The way to do this in Power Query is as follows:
Create a new query which takes your customer dimension table and does a left outer join on customer id with your invoice fact table.
Expand the newly joined table but retain only the new customer id column.
Remove all columns apart from the new customer id column.
Remove duplicates
You now have a list of missing customer ids. Ensure the column name is the same as the column name of you customer id in the customer dimension table. Append this to the original customer dimension query and the nulls will be filled in automatically for the missing columns.
Please keep in mind that It is Kimball, not Kimble.
There are 4 steps of DWH Methodology:
1) Understand Business Process (What your process is actually measuring?)
2) Deciding the grain (It means what every row in your fact table actually represents?)
3) Deciding Dimensions (Ask Where-What-Who-Where-How-HowMany-HowMuch to your grain declaration formed together with business processing)
4) Define Facts (Metrics)
According to this order, You define Dimension tables before building your fact tables: If your dimension table , Customer table in this case, is missing in terms of customers available in your fact table, My biggest biggest advice according to the DWH Dimensional Modeling is to set your customer table right!!! Define every piece of customer in your dimension table!!!! Then populate your Fact table with records:
[Customer ID] in Customer Table : PRIMARY KEY
[CustomerID] in Invoice Table : FOREIGN KEY
SQL and Power BI reacts very differently in your problem:
1) Power BI has no referential integrity concept: It adds a blank row to your dimension table in such a case.
2) SQL gives referential integrity error, and you can't even add rows to your fact table. I support SQL in this case personally!!!!
Finally: Use some ETL tool(SSIS, Talend, ODI or even Power Query) to make your dimension table as accurate as possible:
For example:
Do not leave any column value as null!
If an unknown date exists, put a default date value like '1900-12-31'
If an unknown textual property, put in keywords 'unknown','not available' etc..
Because dimensional table are sources of querying in SQL statements; and different SQL Vendors (SQL Server, Oracle, MySQL) has to deal with NULL values in a different way, and this cause problems in terms of performance wise!!

Google Spanner - How do you copy data to another table?

Since spanner does not have ddl feature like
insert into dest as (select * from source_table)
How do we select subset of a table and copy that rows into another table ?
I am trying to write data to temporary table and then move data to archive table at the end of day. But only solution i could find so far is, select rows from source table and write them to new table. Which is done using java api, and it does not have a ResultSet to Mutation converter, so i need to map every column of table to new table, even they are exactly same.
Another thing is updating just one column data, like there is no way of doing "update table_name set column= column-1 "
Again to do that, i need to read that row and map every field to update Mutation, but this is not useful if have many tables, i need to code for all of them, a ResultSet -> Mutation converted would be nice too.
Is there any generic Mutation cloner and/or any other way to copy data between tables?
As of version 0.15 this open source JDBC Driver supports bulk INSERT-statements that can be used to copy data from one table to another. The INSERT-syntax can also be used to perform bulk UPDATEs on data.
Bulk insert example:
INSERT INTO TABLE
(COL1, COL2, COL3)
SELECT C1, C2, C3
FROM OTHER_TABLE
WHERE C1>1000
Bulk update is done using an INSERT-statement with the addition of ON DUPLICATE KEY UPDATE. You have to include the value of the primary key in your insert statement in order to 'force' a key violation which in turn will ensure that the existing rows will be updated:
INSERT INTO TABLE
(COL1, COL2, COL3)
SELECT COL1, COL2+1, COL3+COL2
FROM TABLE
WHERE COL2<1000
ON DUPLICATE KEY UPDATE
You can use the JDBC driver with for example SQuirreL to test it, or to do ad-hoc data manipulation.
Please note that the underlying limitations of Cloud Spanner still apply, meaning a maximum of 20,000 mutations in one transaction. The JDBC Driver can work around this limit by specifying the value AllowExtendedMode=true in your connection string or in the connection properties. When this mode is allowed, and you issue a bulk INSERT- or UPDATE-statement that will exceed the limits of one transaction, the driver will automatically open an extra connection and perform the bulk operation in batches on the new connection. This means that the bulk operation will NOT be performed atomically, and will be committed automatically after each successful batch, but at least it will be done automatically for you.
Have a look here for some more examples: http://www.googlecloudspanner.com/2018/02/data-manipulation-language-with-google.html
Another approach to perform Bulk update can be using LIMIT & OFFSET
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000);
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000 OFFSET 1001);
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000 OFFSET 2001);
.
.
.
reach till where required.
PS: This is more of a trick. But will definitely save you time.
Spanner supports expression in the SET section of an UPDATE statement which can be used to supply a subquery fetching data from another table like this:
UPDATE target_table
SET target_field = (
-- use subquery as an expression (must return a single row)
SELECT source_table.source_field
FROM source_table
WHERE my_condition IS TRUE
) WHERE my_other_condition IS TRUE;
The generic syntax is:
UPDATE table SET column_name = { expression | DEFAULT } WHERE condition

Athena: Minimize data scanned by query including JOIN operation

Let there be an external table in Athena which points to a large amount of data stored in parquet format on s3. It contains a lot of columns and is partitioned on a field called 'timeid'. Now, there's another external table (small one) which maps timeid to date.
When the smaller table is also partitioned on timeid and we join them on their partition id (timeid) and put date into where clause, only those specific records are scanned from large table which contain timeids corresponding to that date. The entire data is not scanned here.
However, if the smaller table is not partitioned on timeid, full data scan takes place even in the presence of condition on date column.
Is there a way to avoid full data scan even when the large partitioned table is joined with an unpartitioned small table? This is required because the small table contains only one record per timeid and it might not be expected to create a separate file for each.
That's an interesting discovery!
You might be able to avoid the large scan by using a sub-query instead of a join.
Instead of:
SELECT ...
FROM large-table
JOIN small-table
WHERE small-table.date > '2017-08-03'
you might be able to use:
SELECT ...
FROM large-table
WHERE large-table.date IN
(SELECT date from small-table
WHERE date > '2017-08-03')
I haven't tested it, but that would avoid the JOIN you mention.

Should I use a column both as distkey and sortkey

I have a table in redshift containing billion records ( log file entries ). It has a timestamp column ts on which I have distkey and sortkey. Following query:
select ts from apilogs where date(ts) = '2016-09-08' limit 10;
runs super fast when I query for old date; but not for latest date ! Not sure why ! Any help is appreciated
How I put logs: I had put all old log files in one shot into this table; while every incremental log files I put hourly.
When I checked detailed plan on AWS console; I can see that query taking long time is scanning all billion rows; while query taking few milliseconds is scanning only few thousands of rows ( i.e. rows corresponding to that date )..
So, now question is why it is scanning whole table for latest timestamp !
Dist key and sort key can be on the same column. No Problem!
Your latest data load in your log table, was it sorted according to the sort key? If not, you will have to run vacuum on your log table, so that your sort key column is sorted in that order and redshift does not have to scan unnecessary rows.
Run the below query to check if you have any unsorted region in your table.
select trim(pgdb.datname) as Database,
trim(a.name) as Table, ((b.mbytes/part.total::decimal)*100)::decimal(5,2) as pct_of_total, b.mbytes, b.unsorted_mbytes, (unsorted_mbytes/mbytes::decimal)*100 as unsorted_pct
from stv_tbl_perm a
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, sum(decode(unsorted, 1, 1, 0)) as unsorted_mbytes, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
join ( select sum(capacity) as total
from stv_partitions where part_begin=0 ) as part on 1=1
where a.slice=0 and a.name in ('apilogs')
order by 3 desc, db_id, name;
If you have unsorted region, run
Vacuum apilogs to 100 percent
It seems like you haven't run vacuum on your table, after you added rows for latest timestamp.
Here's the part, most relevant to your use case, from the Redshift documentation:
When data is initially loaded into a table that has a sort key, the data is sorted according to the SORTKEY specification in the CREATE TABLE statement. However, when you update the table, using COPY, INSERT, or UPDATE statements, new rows are stored in a separate unsorted region on disk, then sorted on demand for queries as required. If large numbers of rows remain unsorted on disk, query performance might be degraded for operations that rely on sorted data, such as range-restricted scans or merge joins. The VACUUM command merges new rows with existing sorted rows, so range-restricted scans are more efficient and the execution engine doesn't need to sort rows on demand during query execution.
P.S.- You shouldn't be worried about your distribution key here, as they come into picture only during joins.

Redshift UPDATE prohibitively slow

I have a table in a Redshift cluster with ~1 billion rows. I have a job that tries to update some column values based on some filter. Updating anything at all in this table is incredibly slow. Here's an example:
SELECT col1, col2, col3
FROM SOMETABLE
WHERE col1 = 'a value of col1'
AND col2 = 12;
The above query returns in less than a second, because I have sortkeys on col1 and col2. There is only one row that meets this criteria, so the result set is just one row. However, if I run:
UPDATE SOMETABLE
SET col3 = 20
WHERE col1 = 'a value of col1'
AND col2 = 12;
This query takes an unknown amount of time (I stopped it after 20 minutes). Again, it should be updating one column value of one row.
I have also tried to follow the documentation here: http://docs.aws.amazon.com/redshift/latest/dg/merge-specify-a-column-list.html, which talks about creating a temporary staging table to update the main table, but got the same results.
Any idea what is going on here?
You didn't mention what percentage of the table you're updating but it's important to note that an UPDATE in Redshift is a 2 step process:
Each row that will be changed must be first marked for deletion
Then a new version of the data must be written for each column in the table
If you have a large number of columns and/or are updating a large number of rows then this process can be very labor intensive for the database.
You could experiment with using a CREATE TABLE AS statement to create a new "updated" version of the table and then dropping the existing table and renaming the new table. This has the added benefit of leaving you with a fully sorted table.
Actually I don't think RedShift is designed for bulk updates, RedShift is designed for OLAP instead of OLTP, update operations are inefficient on RedShift by nature.
In this use case, I would suggest to do INSERT instead of UPDATE, while add another column of the TIMESTAMP, and when you do analysis on RedShift, you'll need extra logic to get the latest TIMESTAMP to eliminate possible duplicated data entries.