Is there anyway to do SELECT * EXCEPT (col1, col2, ...) ... in RedShift? - amazon-web-services

In BigQuery I can write:
SELECT * EXCEPT (col1, col2, ...) ...
Is there an equivalent for RedShift? I don't think there is, but I wanted to see if anyone had any bright ideas.
Incidentally, I find this to be very useful in BigQuery when writing multiple subqueries, each flowing into the next. I can include/exclude columns at the relevant part of the query without having it break something later on, which is very useful when developing a complex query.

Not to my knowledge.
The only EXCEPT is the normal SELECT functionality to subtract one relation from another.

Related

Does the syntax in a Power BI join cause a data refresh?

I'm trying to make a Power BI report that someone else created run faster and as I'm going through the queries I've noticed some of the merged queries have different syntax and I'm wondering if the different syntax is causing a data refresh to occur during the merge.
Below are 2 different merged queries, but one has the # sign before the table name with the table name in quotes and the other does not. What is the significance of not having the # sign?
It's the #"Org_Roll-Up" vs Account_Groups.
Syntax 1
= Table.NestedJoin(#"Changed Type9", {"COMPANY"}, #"Org_Roll-Up", {"ORG"}, "Org_Roll-Up", JoinKind.LeftOuter)
Syntax 2
= Table.NestedJoin(#"Removed Columns", {"ACCOUNT"}, Account_Groups, {"ACCOUNT"}, "Account_Groups", JoinKind.LeftOuter)
I'm trying to get the queries to run once and then send the data to other queries as needed instead of refreshing each time. I have parallel turned of and background data refresh off.
The # syntax makes zero difference. Variable names require a # when they have spaces or special characters in them otherwise they're not required. See here for more details.
https://bengribaudo.com/blog/2018/01/19/4321/power-query-m-primer-part4-variables-identifiers
The way you’ve worded this, I just want to make sure you understand how Power Query does a merge. When you merge query A with query B, Power Query will run query B again for use by query A. It doesn’t pull previously loaded data by query B or table B from the data model (nor does it change table B). It will run the query B again and join it with query A and load the result to table A. So in syntax 1, #"Org_Roll-Up" will re-run, and in syntax 2, Account_Groups will re-run. Depending on what the query does and how many rows are on the table, you can have quite different performance through small changes. See Chris Webb’s 3 part series here for more ideas: https://blog.crossjoin.co.uk/2020/05/31/optimising-the-performance-of-power-query-merges-in-power-bi-part-1/

Redshift Spectrum - Referencing an external table in a CTE?

I'm trying to make some data available via Redshift Spectrum to our reporting platform. I chose Spectrum because it offers lower latency to our data lake vs a batched ETL process.
One of the queries I have looks like this
with txns as (select * from spectrum_table where ...)
select field1, field2, ...
from txns t1
left join txns t2 on t2.id = t1.id
left join txns t3 on t3.id = t1.id
where...
Intuitively, this should cache the Spectrum query output in-memory with the CTE, and make it available to query later in query without hitting S3 a second (or third) time.
However, I checked the explain plan, and with each join the number of "S3 Seq Scan"s goes up by one. So it appears to do the Spectrum scan each time the CTE is queried.
Questions:
Is this actually happening? Or is the explain plan wrong? The run-time of this query doesn't appear to increase linearly with the number of joins, so it's hard to tell.
If it is happening, what other options are there to achieve this sort of result? Other than manually creating a temp table (this will be accessed by a reporting tool, so I'd prefer to avoid allowing explicit write access or requiring multiple statements to get the data)
Thanks!
Yes this is really happening. CTE references are not reused - this is due to the possibility that different data will be used in the different references. Applying where clauses at table scan is an important performance feature.
You could look into using a materialized view but I expect that you are dynamically applying the where clauses in the CTE so this may not match you need. If it was me I'd want to understand why the triple self-join. Seems like there may be a better way to construct the query but it is just a gut feel.

Why doesn't Snowflake support CTE scope (any workaround?)

I'm a Business Intelligence (BI) consultant and I'm running into an issue where Snowflake doesn't support CTE scope.
In BI, it's incredibly useful to redefine bits of SQL. However, if I define a CTE called revenue_calculations then put something new in the where clause and re-declare revenue_calculations as a new CTE further down in the script(or nested within another CTE declaration), Snowflake only reads Revenue Calculations one time and uses the first CTE declaration throughout the script.
Most other databases (Bigquery for example) and programming languages have scope for objects. Is there any workaround to this? Will this be changing?
***Updated to include code sample
with cte_in_question as (select 1),
cte2 as (
with cte_in_question as (select 2)
select * from cte_in_question
)
SELECT * FROM cte2
Snowflake evaluates this to 1 and BQ to 2. 2 seems much more correct to me. Thoughts?
It turns out that in Snowflake, by default, the data from outer CTE is returned. But this behaviour can be altered. You need to contact Snowflake support and request them to change this behaviour (at your account level) so that data from inner CTE will be returned.

Athena equivalent to information_schema

For background, I come from a SQLServer background and make heavy use of the system tables & information_schema, to tell me all about my tables and columns.
I didn't expect the exact same power in Athena, but currently very shocked and frustrated with what little seems to be available - unless I've missed something ?
For example, 'describe mytable' - just describes 1 table at a time.
How about showing the columns for ALL tables in one result ?
It also does not output the table name, nor allow you to manually add that in as a custom column.
All the results of these "show/list/describe" commands seem to produce a text list - not a recordset, so you cannot take the results and join them to other tables or views to make more complex outputs.
Is there any other way to query the contents of my databases ?
Thanks in advance
Athena is based on Presto. Presto provides information_schema schema and I checked and it is accessible in Athena.
You can run e.g. a query like:
SELECT * FROM information_schema.columns;
to get a list of columns of all tables.
You can filter this by "database":
SELECT * FROM information_schema.columns WHERE table_schema = '<databasename>';
Note however that these types of queries are not necessarily very performant.

Applying Left Join in Pentaho

I'm try to create Transformation and need to merge two Database based on query like that by using Merge Join and I little bit confuse what should i filled in First Step, Second Step to Lookup for that each query format.
Query Format :
SELECT * FROM A a LEFT JOIN B b on a.value=b.value
SELECT * FROM A a LEFT JOIN B b on b.value=a.value
There are various way to do it.
Write the sql with the join in the Table input step. Quick an dirty solution if your table are in the same database, but do not tell a PDI expert you did it that way.
If you know there is only one B record for each A record, use a Lookup Stream Step. Very, very, very efficient. The Main flow is the A and the lookup step is B.
If you have many B records for each A records, use a Join Rows. Don't be afraid, you do not really make a Cartesian product, as you can put your condition a.value=b.value.
In the same situation, you can also make a Merge join. The first step is the step you write fist in the sql select statement.
Multiple ways to do this.
you can use TableInput Step and just simply write your query. No need to do anything else for implementing above query.