Why doesn't Snowflake support CTE scope (any workaround?) - common-table-expression

I'm a Business Intelligence (BI) consultant and I'm running into an issue where Snowflake doesn't support CTE scope.
In BI, it's incredibly useful to redefine bits of SQL. However, if I define a CTE called revenue_calculations then put something new in the where clause and re-declare revenue_calculations as a new CTE further down in the script(or nested within another CTE declaration), Snowflake only reads Revenue Calculations one time and uses the first CTE declaration throughout the script.
Most other databases (Bigquery for example) and programming languages have scope for objects. Is there any workaround to this? Will this be changing?
***Updated to include code sample
with cte_in_question as (select 1),
cte2 as (
with cte_in_question as (select 2)
select * from cte_in_question
)
SELECT * FROM cte2
Snowflake evaluates this to 1 and BQ to 2. 2 seems much more correct to me. Thoughts?

It turns out that in Snowflake, by default, the data from outer CTE is returned. But this behaviour can be altered. You need to contact Snowflake support and request them to change this behaviour (at your account level) so that data from inner CTE will be returned.

Related

CREATE TABLE using WITH clause in redshift not working [duplicate]

I need to create an empty time table series for a report so I can left join activity from several tables to it. Every hour of the day does not necessarily have data, but I want it to show null or zero for inactivity instead of omitting that hour of the day.
In later versions of Postgres (post 8.0.2), this is easy in several ways:
SELECT unnest(array[0,1,2,3,4...]) as numbers
OR
CROSS JOIN (select generate_series as hours
from generate_series(now()::timestamp,
now()::timestamp + interval '1 day',
'1 hour'::interval
)) date_series
Redshift can run some of these commands, but throws an error when you attempt to run it in conjunction with any of the tables.
WHAT I NEED:
A reliable way to generate a series of numbers (e.g. 0-23) as a subquery that will run on redshift (uses postgres 8.0.2).
As long as you have a table that has more rows than your required series has numbers, this is what has worked for me in the past:
select
(row_number() over (order by 1)) - 1 as hour
from
large_table
limit 24
;
Which returns numbers 0-23.
Unfortunately, Amazon Redshift does not allow use of generate_series() for table functions. The workaround seems to be creating a table of numbers.
See also:
Using sql function generate_series() in redshift
Generate Series in Redshift and MySQL, which does not seem correct but does introduce some interesting ideas
Recursion was released for Redshift in April 2021. Now that recursion is possible in Redshift. You can generate series of numbers (or even table) with below code
with recursive numbers(NUMBER) as
(
select 1 UNION ALL
select NUMBER + 1 from numbers where NUMBER < 28
)
I'm not a big fan of querying a system table just to get a list of row numbers. If it's something constant and small enough like hours of a day, I would go with plain old UNION ALL:
WITH
hours_in_day AS (
SELECT 0 AS hour
UNION ALL SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
...
UNION ALL SELECT 23
)
And then joinhours_in_day to whatever you want to.

Redshift Spectrum - Referencing an external table in a CTE?

I'm trying to make some data available via Redshift Spectrum to our reporting platform. I chose Spectrum because it offers lower latency to our data lake vs a batched ETL process.
One of the queries I have looks like this
with txns as (select * from spectrum_table where ...)
select field1, field2, ...
from txns t1
left join txns t2 on t2.id = t1.id
left join txns t3 on t3.id = t1.id
where...
Intuitively, this should cache the Spectrum query output in-memory with the CTE, and make it available to query later in query without hitting S3 a second (or third) time.
However, I checked the explain plan, and with each join the number of "S3 Seq Scan"s goes up by one. So it appears to do the Spectrum scan each time the CTE is queried.
Questions:
Is this actually happening? Or is the explain plan wrong? The run-time of this query doesn't appear to increase linearly with the number of joins, so it's hard to tell.
If it is happening, what other options are there to achieve this sort of result? Other than manually creating a temp table (this will be accessed by a reporting tool, so I'd prefer to avoid allowing explicit write access or requiring multiple statements to get the data)
Thanks!
Yes this is really happening. CTE references are not reused - this is due to the possibility that different data will be used in the different references. Applying where clauses at table scan is an important performance feature.
You could look into using a materialized view but I expect that you are dynamically applying the where clauses in the CTE so this may not match you need. If it was me I'd want to understand why the triple self-join. Seems like there may be a better way to construct the query but it is just a gut feel.

Is there anyway to do SELECT * EXCEPT (col1, col2, ...) ... in RedShift?

In BigQuery I can write:
SELECT * EXCEPT (col1, col2, ...) ...
Is there an equivalent for RedShift? I don't think there is, but I wanted to see if anyone had any bright ideas.
Incidentally, I find this to be very useful in BigQuery when writing multiple subqueries, each flowing into the next. I can include/exclude columns at the relevant part of the query without having it break something later on, which is very useful when developing a complex query.
Not to my knowledge.
The only EXCEPT is the normal SELECT functionality to subtract one relation from another.

Google Spanner - How do you copy data to another table?

Since spanner does not have ddl feature like
insert into dest as (select * from source_table)
How do we select subset of a table and copy that rows into another table ?
I am trying to write data to temporary table and then move data to archive table at the end of day. But only solution i could find so far is, select rows from source table and write them to new table. Which is done using java api, and it does not have a ResultSet to Mutation converter, so i need to map every column of table to new table, even they are exactly same.
Another thing is updating just one column data, like there is no way of doing "update table_name set column= column-1 "
Again to do that, i need to read that row and map every field to update Mutation, but this is not useful if have many tables, i need to code for all of them, a ResultSet -> Mutation converted would be nice too.
Is there any generic Mutation cloner and/or any other way to copy data between tables?
As of version 0.15 this open source JDBC Driver supports bulk INSERT-statements that can be used to copy data from one table to another. The INSERT-syntax can also be used to perform bulk UPDATEs on data.
Bulk insert example:
INSERT INTO TABLE
(COL1, COL2, COL3)
SELECT C1, C2, C3
FROM OTHER_TABLE
WHERE C1>1000
Bulk update is done using an INSERT-statement with the addition of ON DUPLICATE KEY UPDATE. You have to include the value of the primary key in your insert statement in order to 'force' a key violation which in turn will ensure that the existing rows will be updated:
INSERT INTO TABLE
(COL1, COL2, COL3)
SELECT COL1, COL2+1, COL3+COL2
FROM TABLE
WHERE COL2<1000
ON DUPLICATE KEY UPDATE
You can use the JDBC driver with for example SQuirreL to test it, or to do ad-hoc data manipulation.
Please note that the underlying limitations of Cloud Spanner still apply, meaning a maximum of 20,000 mutations in one transaction. The JDBC Driver can work around this limit by specifying the value AllowExtendedMode=true in your connection string or in the connection properties. When this mode is allowed, and you issue a bulk INSERT- or UPDATE-statement that will exceed the limits of one transaction, the driver will automatically open an extra connection and perform the bulk operation in batches on the new connection. This means that the bulk operation will NOT be performed atomically, and will be committed automatically after each successful batch, but at least it will be done automatically for you.
Have a look here for some more examples: http://www.googlecloudspanner.com/2018/02/data-manipulation-language-with-google.html
Another approach to perform Bulk update can be using LIMIT & OFFSET
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000);
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000 OFFSET 1001);
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000 OFFSET 2001);
.
.
.
reach till where required.
PS: This is more of a trick. But will definitely save you time.
Spanner supports expression in the SET section of an UPDATE statement which can be used to supply a subquery fetching data from another table like this:
UPDATE target_table
SET target_field = (
-- use subquery as an expression (must return a single row)
SELECT source_table.source_field
FROM source_table
WHERE my_condition IS TRUE
) WHERE my_other_condition IS TRUE;
The generic syntax is:
UPDATE table SET column_name = { expression | DEFAULT } WHERE condition

Reading (even joining) a very large (1.1bn row) table in Enterprise Guide from Teradata

Hopefully you guys can help with what I'm hoping is quite a simple question for those in the know!
I live (well, work) in SAS Enterprise Guide and am trying to perform a simple left join against a table in Teradata.
The table is extremely large (700+ columns, 1.1bn rows) and so far I have been connecting via a LIBNAME statement at the top of my program, followed by the usual PROC SQL to read the data.
The issue I am having is its is extremely slow. I performed the join successfully using 90 rows on the left table and it took 3 hours to complete. The real table I want to use has something like 15,000 rows.
I have tried to connect via the SQL Pass-Through method, but this throws a hosts file error, which I can't fix due to corporate security limitations.
Has anyone had any experience performing this kind of task?
I should mention that I can run a simple select * query in Teradata SQL Assistant is just over 1 minute (16,666,666 obs/s!) so the limitation must be somewhere between SAS/Teradata, or even SAS itself.
I'm sorry I haven't posted actual code snippets as they're on my work machine but this has been bugging me for ages so thought I'd see if I'm missing any tricks.
Thanks in advance for your help.
So you're joining a SAS data set to a Teradata table and want to return the matching records. You'll want to use SAS's DBMASTER= data set option. It designates which of the tables is larger. By telling SAS this, it knows which table to move.
Here I assume librefs have already been assigned and that the Teradata table is larger--more obs--than the SAS data set.
proc sql threads; select tdTable.* from sastables.sasTable1, td.tdTable(dbmaster=yes)
where tdTable.idNum=sasTable1.idNum; quit;
If by chance your SAS data set is larger, you'll want to use the MULTI_DATASRC_OPT= option. Either google these terms or look in the SAS/Access to Relational Databases manual. It's pretty good.
Good luck.
Have you considered creating a volatile table in Teradata? Since this is created in your spool allocation you shouldn't need explicit permissions to create the table. Once created you can load the SAS data set into the Volatile table and collect statistics on the table's join columns and filter columns. This will help the optimizer understand the demographics about your "small" table. The volatile table will only persist for the duration of your session and is not accessible across multiple sessions.
Then rewrite your SAS code to push-down the SQL to Teradata joining the large table to your volatile table. The results can be returned to SAS and loaded into another data set.
CREATE VOLATILE TABLE MyTable, NO FALLBACK
( ColA SMALLINT NOT NULL,
ColB VARCHAR(10) NOT NULL
) PRIMARY INDEX (ColA)
ON COMMIT PRESERVE ROWS /* This is important */
;
The primary index is how Teradata distributes the data and accesses the data. Tables distributed on the same column will join "AMP local" and will not require a redistribution. This is not always possible, as your primary index selection has to consider even distribution as well as access path. The primary index does not have to be unique, but can be.
Hope this helps.