CTAS vs INSERT/SELECT to empty columnar table on Azure SQL Data Warehouse - azure-sqldw

I am running a series of tests to understand the throughput per DWU. I have eight (8) scenarios varying the ETL approach (CTAS vs INSERT/SELECT), varying the input table type (heap vs columnar), and varying the output table type (heap vs columnar).
Unexpectedly, using a columnar input table, writing to a columnar output table, using either INSERT/SELECT or CTAS yielded the same throughput (8,100 rows per second per DWU).
Why would there not be some penalty associated with "full logging" of the INSERT/SELECT construct?
Givens:
DWU = 600
table 17 columns with 1.33B rows
target table empty
beforehand
INSERT/SELECT Script:
CREATE TABLE
etl_schema_name.fact_table_benchmark_testing
(
column_1 INTEGER NOT NULL
,column_2 INTEGER NOT NULL
,column_3 SMALLINT NOT NULL
,column_4 SMALLINT NOT NULL
,column_5 INTEGER NOT NULL
,column_6 DECIMAL(9,4) NOT NULL
,column_7 DECIMAL(9,2) NOT NULL
,column_8 SMALLINT NOT NULL
,column_9 CHAR(1) NOT NULL
,column_10 SMALLINT NOT NULL
,column_11 DECIMAL(9,2) NOT NULL
,column_12 DECIMAL(9,2) NOT NULL
,column_13 DECIMAL(9,2) NOT NULL
,column_14 DECIMAL(9,2) NOT NULL
,column_15 DECIMAL(9,2) NOT NULL
,column_16 DECIMAL(9,2) NOT NULL
,column_17 DECIMAL(9,2) NOT NULL
)
WITH
(
DISTRIBUTION = HASH ( column_2 )
)
;
GO
insert into
etl_schema_name.fact_table_benchmark_testing
(
column_1
,column_2
,column_3
,column_4
,column_5
,column_6
,column_7
,column_8
,column_9
,column_10
,column_11
,column_12
,column_13
,column_14
,column_15
,column_16
,column_17
)
select
column_1
,column_2
,column_3
,column_4
,column_5
,column_6
,column_7
,column_8
,column_9
,column_10
,column_11
,column_12
,column_13
,column_14
,column_15
,column_16
,column_17
FROM
production_schema_name.fact_table
;
GO
CTAS Script
CREATE TABLE
etl_schema_name.fact_table_benchmark_testing_2
WITH
(
DISTRIBUTION = HASH ( column_2 )
)
as
select
column_1
,column_2
,column_3
,column_4
,column_5
,column_6
,column_7
,column_8
,column_9
,column_10
,column_11
,column_12
,column_13
,column_14
,column_15
,column_16
,column_17
FROM
production_schema_name.fact_table
;
GO

INSERT...SELECT is not necessarily fully logged in SQLDW. Have you had a chance to review the following article?
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-develop-best-practices-transactions

Related

In Presto how to query a cloumn which no value

I have a integer column which is empty. How to query that I tried
select * from table where target_status_code = null
but it did not return anything
Probably you should use is null to check the integer column which is empty:
select * from table where target_status_code is null

Why does left join in redshift not working?

We are facing a weird issue with Redshift and I am looking for help to debug it please. Details of the issue are following:
I have 2 tables and I am trying to perform left join as follows:
select count(*)
from abc.orders ot
left outer join abc.events e on **ot.context_id = e.context_id**
where ot.order_id = '222:102'
Above query returns ~7000 records. Looks like it is performing default join as we have only 1 record in [Orders] table with Order ID = ‘222:102’
select count(*)
from abc.orders ot
left outer join abc.events e on **ot.event_id = e.event_id**
where ot.order_id = '222:102'
Above query returns 1 record correctly. If you notice, I have just changed column for joining 2 tables. Event_ID in [Events] table is identity column but I thought I should get similar records even if I use any other column like Context_ID.
Further, I tried following query under the impression it should return all the ~7000 records as I am using default join but surprisingly it returned only 1 record.
select count(*)
from abc.orders ot
**join** abc.events e on ot.event_id = e.event_id
where ot.order_id = '222:102'
Following are the Redshift database details:
Cutdown version of table metadata:
CREATE TABLE abc.orders (
order_id character varying(30) NOT NULL ENCODE raw,
context_id integer ENCODE raw,
event_id character varying(21) NOT NULL ENCODE zstd,
FOREIGN KEY (event_id) REFERENCES events_20191014(event_id)
)
DISTSTYLE EVEN
SORTKEY ( context_id, order_id );
CREATE TABLE abc.events (
event_id character varying(21) NOT NULL ENCODE raw,
context_id integer ENCODE raw,
PRIMARY KEY (event_id)
)
DISTSTYLE ALL
SORTKEY ( context_id, event_id );
Database: Amazon Redshift cluster
I think, I am missing something essential while joining the tables. Could you please guide me in right direction?
Thank you

How to add a partition boundary only when not exists in SQL Data Warehouse?

I am using Azure SQL Data Warehouse Gen 1, and I create a partition table like this
CREATE TABLE [dbo].[StatsPerBin1](
[Bin1] [varchar](100) NOT NULL,
[TimeWindow] [datetime] NOT NULL,
[Count] [int] NOT NULL,
[Timestamp] [datetime] NOT NULL)
WITH
(
DISTRIBUTION = HASH ( [Bin1] ),
CLUSTERED INDEX([Bin1]),
PARTITION
(
[TimeWindow] RANGE RIGHT FOR VALUES ()
)
)
How should I split a partition only when there is no such boundary?
First I think if I can get partition boundaries by table name, then I can write a if statement to determine add partition boundary or not.
But I cannot find a way to associate a table with its corresponding partition values, the partition values of all partitions can be retrieved by
SELECT * FROM sys.partition_range_values
But it only contains function_id as identifier which I don't know how to join other tables so that I can get partition boundaries by table name.
Have you tried joining sys.partition_range_values with sys.partition_functions view?
Granted we cannot create partition functions in SQL DW, but the view seems to be still supported.
I know this is an out of date question, but I was having the same problem. Here is a query I ended up with that can get you started. It is modified slightly from a query for SQL Server documentation:
SELECT s.[name] AS [schema_name]
, t.[name] AS [table_name]
, p.[partition_number] AS [partition_number]
, rv.[value] AS [partition_boundary_value]
, p.[data_compression_desc] AS [partition_compression_desc]
FROM sys.schemas s
JOIN sys.tables t ON t.[schema_id] = s.[schema_id]
JOIN sys.partitions p ON p.[object_id] = t.[object_id]
JOIN sys.indexes i ON i.[object_id] = p.[object_id]
AND i.[index_id] = p.[index_id]
JOIN sys.data_spaces ds ON ds.[data_space_id] = i.[data_space_id]
LEFT JOIN sys.partition_schemes ps ON ps.[data_space_id] = ds.[data_space_id]
LEFT JOIN sys.partition_functions pf ON pf.[function_id] = ps.[function_id]
LEFT JOIN sys.partition_range_values rv ON rv.[function_id] = pf.[function_id]
AND rv.[boundary_id] = p.[partition_number]

Complex SQL syntax

I have a game, and in the database I'm saving the user actions by date & time.
CREATE TABLE user_actions
(
aId BIGSERIAL PRIMARY KEY NOT NULL,
userId BIGINT NOT NULL REFERENCES users(uId) DEFERRABLE INITIALLY DEFERRED,
aDate TIMESTAMP without time zone DEFAULT now(),
aType INTEGER NOT NULL DEFAULT 0
);
My users are identified with email
CREATE TABLE users(
uId BIGSERIAL PRIMARY KEY NOT NULL,
uName VARCHAR (50) NOT NULL,
uEmail VARCHAR (75) UNIQUE NULL
);
and each day new prizes are added each day has a different number of prizes
CREATE TABLE prizes(
pId BIGSERIAL PRIMARY KEY NOT NULL,
pDate TIMESTAMP without time zone DEFAULT now(),
pType INTEGER NULL
pSize INTEGER NULL
);
This query list the userId and his last action date, per user
select distinct userId, max(aDate) from user_actions GROUP BY userId order by userId;
I want to create a query that will count the number of prizes added since each user last action.
I'm running:
OS: Debian
DB: Postgresql
code: Django
I think I will use CTE though It has not been tested
WITH last_actions AS (
SELECT DISTINCT userId, MAX(aDate) as last_logged
FROM user_actions
GROUP BY userId ORDER BY userId)
SELECT a.userId, COUNT(b.pDate)
FROM last_actions a, prizes b
WHERE b.pDate >= a.last_logged
GROUP BY a.userId;

Can't create table (errno: 150) InnoDB adding foreign key constraints

Really hate to use other people's time, but it seems the problem is just not going away.
I considered all recommendations at http://verysimple.com/2006/10/22/mysql-error-number-1005-cant-create-table-mydbsql-328_45frm-errno-150/ and at http://forums.mysql.com/read.php?22,19755,19755#msg-19755 but nothing.
hope that someone points to a stupid mistake.
here are the tables:
CREATE TABLE IF NOT EXISTS `shop`.`category` (
`id` INT(11) NOT NULL AUTO_INCREMENT ,
`category_id` INT(11) NOT NULL ,
`parent_id` INT(11) NULL DEFAULT '0' ,
`lang_id` INT(11) NOT NULL ,
...other columns...
PRIMARY KEY (`id`, `category_id`) )
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
COLLATE = utf8_unicode_ci;
CREATE TABLE IF NOT EXISTS `shop`.`product_category` (
`category_id` INT(11) NOT NULL ,
`product_id` INT(11) NOT NULL ,
INDEX `fk_product_category_category1_zxc` (`category_id` ASC) ,
CONSTRAINT `fk_product_category_category1_zxc`
FOREIGN KEY (`category_id` )
REFERENCES `shop`.`category` (`category_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION)
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
COLLATE = utf8_unicode_ci;
Error Code: 1005. Can't create table 'shop.product_category' (errno: 150)
You need an index on category_id in the category table (I see it's part of the primary key, but since it's the second column in the index, it can not be used). The field you are referencing in a foreign key always should be indexed.
In my case the issue was more like what was described in the first article you've linked to.
So I just had to make sure that:
Referenced Column is an index,
both Referencing Column and Referenced Column share the same type and length, i.e. e.g. both are INT(10),
both share the same not null, unsigned, zerofill etc. configuration.
both tables are InnoDB!
Here's the query template where Referencing Column is referencing_id and Referenced Column is referenced_id:
ALTER TABLE `db`.`referencing`
ADD CONSTRAINT `my_fk_idx`
FOREIGN KEY (`referencing_id`)
REFERENCES `db`.`referenced`(`referenced_id`)
ON DELETE NO ACTION
ON UPDATE NO ACTION;
Update 2016-03-13: Ran into this problem again, ended up finding my own answer. This time it didn't help though. Turns out the other table was still set to MyISAM, as soon as I changed it to InnoDB everything worked.