Redshift Spectrum - Referencing an external table in a CTE? - amazon-web-services

I'm trying to make some data available via Redshift Spectrum to our reporting platform. I chose Spectrum because it offers lower latency to our data lake vs a batched ETL process.
One of the queries I have looks like this
with txns as (select * from spectrum_table where ...)
select field1, field2, ...
from txns t1
left join txns t2 on t2.id = t1.id
left join txns t3 on t3.id = t1.id
where...
Intuitively, this should cache the Spectrum query output in-memory with the CTE, and make it available to query later in query without hitting S3 a second (or third) time.
However, I checked the explain plan, and with each join the number of "S3 Seq Scan"s goes up by one. So it appears to do the Spectrum scan each time the CTE is queried.
Questions:
Is this actually happening? Or is the explain plan wrong? The run-time of this query doesn't appear to increase linearly with the number of joins, so it's hard to tell.
If it is happening, what other options are there to achieve this sort of result? Other than manually creating a temp table (this will be accessed by a reporting tool, so I'd prefer to avoid allowing explicit write access or requiring multiple statements to get the data)
Thanks!

Yes this is really happening. CTE references are not reused - this is due to the possibility that different data will be used in the different references. Applying where clauses at table scan is an important performance feature.
You could look into using a materialized view but I expect that you are dynamically applying the where clauses in the CTE so this may not match you need. If it was me I'd want to understand why the triple self-join. Seems like there may be a better way to construct the query but it is just a gut feel.

Related

Patterns for replicating data to BigQuery

I'm asking for the best practice/industrial standard on these types of jobs, this is what I've been doing:
The end goal is to have a replication of the data in BigQuery
Get the data from an API (incrementally using the previous watermark, using a field like updated_at)
Batch load into native BigQuery table (the main table)
Run an Update-ish query, like this
select * (except _rn)
from (select *, row_number() over (partition by <id> order by updated_at desc) as _rn)
where _rn = 1
Essentially, only get the rows which are the most up-to-date. I'm opting for a table instead of a view to facilliate downtream usages.
This methods works for small table, but when the volume increases, it will face some issues:
Whole table will be recreated, whether partitioned or not
If partitioned, I could easily ran into quota limits
I've also looked for other methods, including loading into a staging table and then perform merge operation between them.
So I'm asking for advice on what your preferred methods/patterns/architecture are to achieve the end goals.
Thanks

Is it always poor performance when using the entire table in DAX?

So I am reading that using the table name causes table expansion which results in slow performance.
I understand that this can happen when tablename is used in FILTER. However if I use the tablename in REMOVEFILTERS, then does this also cause table expansion/poor performance?
No, when you use REMOVEFILTERS, DAX engine only marks the expanded table from which all the filters are to be removed and thats all it does, this doesn't have any performance impact.
Refering table names in basic calculations is fine the problem arises when you nest iterators and there are complex code in the row context and also if the cardinality of the each iterator is pretty high then in most of the cases what DAX engine would do is it would materialize the whole table in memory in form of a data cache on which the Formula Engine will later iterate.
Huge materialized tables are a big problem because the architecture of Vertipaq is to reduce the RAM usage, but the materialized tables are uncompressed tables in memory and they can consume a lot of RAM.
If you want to create materialization yourself for testing, then try using SAMPLE or GROUBPY as a top level function and use DAX Studio to confirm the materialization.
My advice is to always use a single column whenever you can, this allows for faster scanning of the columns/dictionary, more columns will require scanning of the columns and then joining them.

Amazon Athena scans lots of data when query involves only partitions

I have a table on Athena partitioned by day (huge table, TB of data). There's no day column on the table, at least not explicitly. I would expect that a query like the following:
select max(day) from my_table
would scan virtually no data. However, Athena reports that several hundreds of GB are scanned. Any idea why?
===== EDIT 2021-01-14 ===
I've recently bumped on this issue again. It turns out that when the underlying data is parquet then operations on partitions don't consume data. For other data formats that I've tried (including ORC) there is an associated data cost. It doesn't make any sense to me.
I don't know the answer for a fact but I guesstimate:
Athena just does not have the optimization of looking at the partition names only, when only they are queried. This is clear from its behaviour. So it scans everything.
Parquet has min/max for every column whereas ORC does it only if an index is present, AFAIU. Thus for Parquet Athena's query optimizer directs it to look directly at these rollup values, i.e., no scan is performed. It's different for ORC.
I know is a little late to answer this question for you Nicolas but it is important to keep here also some possible solutions.
Unfortunately, this is the way Athena works, Athena will read all data as a tableScan just to list the partitions values.
A possible workaround that works perfectly here is using the metadata of the partition instead of the data information, for example:
Instead of using this syntax:
select max(day) from my_table
Try to use this syntax:
SELECT day FROM my_schema."my_table$partitions" ORDER BY day DESC LIMIT 1
This second statement will read just metadata information and returns the same data you need.
It does not depend on the format but on the compression algorithm used. Snappy for ORC mostly & GZIP for parquet. This is what makes the difference

Using merge in Power Query while keeping native query

I'm trying to reduce my dataset of 1.000.000 records to only the subset I need (+/- 500) by creating an Inner Join to a different table. Unfortunataly it seems that Power Query drops the "native query" and loads the entire dataset before reducing it by merging it with a related table. I have no access to the database unfortunately, otherwise I would have written the SQL myself. Is there a way to make merge work with a native SQL query?
Thanks
I would first check that your "related table" query can run as a native query - right-click on it's last step and check if View Native Query is enabled.
If that's the case, then it may be due to the Join Kind in the Merge Queries step. I've noticed that against SQL Server data sources, Join Kinds other than the default Left Outer Join tend to kill the Native Query option.

Are Redshift system tables immutable and well ordered?

Redshift system tables only story a few days of logging data - periodically backing up rows from these tables is a common practice to collect and maintain proper history. To find new rows added in to system logs I need to check against my backup tables either on query (number) or execution time.
According to an answer on How do I keep more than 5 day's worth of query logs? we can simply select all rows with query > (select max(query) from log). The answer is unreferenced and assumes that query is inserted sequentially.
My question in two parts - hoping for references or code-as-proof - is
are query (identifiers) expected to be inserted sequentially, and
are system tables, e.g. stl_query, immutable or unchanging?
Assuming that we can't verify or prove both the above, then what's the right strategy to backup the system tables?
I am wary of this because I fully expect long running queries to complete after many other queries have started and completed.
I know query (identifier) is generated at query submit time, because I can monitor in progress queries. Therefore it is completed expected that a long running query=1 may complete after query=2. If the stl_query table is immutable then query=1 will be inserted after query=2, and the max(query) logic is flawed.
Alternatively, if query=1 is inserted into stl_query at run time, then the row must be updated upon completion (with end time, duration, etc). This would required me to do an upsert into the backup table.
I think the stl_query table is indeed immutable, it would seem that it's only written to after a query finishes.
Here is why I think that. First off, I ran this query on a cluster with running queries
select count(*) from stl_query where endtime is null
This returns 0. My hunch is that you'll probably see the same thing on your side.
To be double sure, I also ran this query:
select count(*) from stv_inflight i
inner join stl_query q on q.query = i.query
This also returns zero (while I did have queries inflight), which seems to confirm that queries are only logged in stl_query when they have finished executing and are not updated.
That said, I would rewrite the query to insert into your history table as such:
insert into admin.query_history (
select * from stl_query
where query not in (select query from admin.query_history)
)
That way, you'll always insert any records you don't have in the history table.