I would like to know about BigQuery output destination tables - google-cloud-platform

BigQuery scheduled queries and I would like to know one point when spitting out the output of a select statement into a separate table.
I am new to Google Cloud Platform.
The last table that outputs using order by will not have the Is the result of the output spit out as it is?
I would like to write to the table of output results of the schedule query after sorting by order by id. Is that possible?
Sorry for the rudimentary question, but thank you in advance.

BigQuery tables don't guarantee a certain order of its records. Unless you're using clustering - this will physically sort the data in the background. But this will also not guarantee a sorted output of a query. You have to use ORDER BY in order to have a sorted output guaranteed. That means you need a key by which you can sort.

Related

PowerBI Query Performance

I have a PowerBI report that has a few different pages display different visuals. The report uses the same table of data (lets call it Jobs).
The previous author of this report has created two queries in the data section that read off this base table of data, but apply different transformations and filters to the underlying data. Then, the visuals use either of these models to display their data. For example, the first one applies a filter to exclude certain columns based off a status field and the other applies a different filter, and performs transformations on some of the columns
When I manually refresh the report, it looks like the report is retrieving data for both of these queries, even though the base data is the same. Since the dataset is quite large, I am worried that this report has been built inefficiently but I am not sure if there is a better way of doing this.
TL;DR; The Source and Navigation of both of queries is exactly the same - is this retrieving the data twice and causing my report to be inefficient, and if so, what is the approrpiate way to achieve what I am trying to do?
PowerBi will try to parallelize as much as possible. If you have two queries that read from the same table then two queries will be executed.
To avoid this you can:
create a query which only gets the necessary data from the table.
Set this table not to be loaded in the model (toggle "Enable Load")
Every other table that starts from this table won't be a clone of this but will reference it.
In this way, the data will be fetched once from the source and then used to create other tables using PowerQuery.

Amazon Athena scans lots of data when query involves only partitions

I have a table on Athena partitioned by day (huge table, TB of data). There's no day column on the table, at least not explicitly. I would expect that a query like the following:
select max(day) from my_table
would scan virtually no data. However, Athena reports that several hundreds of GB are scanned. Any idea why?
===== EDIT 2021-01-14 ===
I've recently bumped on this issue again. It turns out that when the underlying data is parquet then operations on partitions don't consume data. For other data formats that I've tried (including ORC) there is an associated data cost. It doesn't make any sense to me.
I don't know the answer for a fact but I guesstimate:
Athena just does not have the optimization of looking at the partition names only, when only they are queried. This is clear from its behaviour. So it scans everything.
Parquet has min/max for every column whereas ORC does it only if an index is present, AFAIU. Thus for Parquet Athena's query optimizer directs it to look directly at these rollup values, i.e., no scan is performed. It's different for ORC.
I know is a little late to answer this question for you Nicolas but it is important to keep here also some possible solutions.
Unfortunately, this is the way Athena works, Athena will read all data as a tableScan just to list the partitions values.
A possible workaround that works perfectly here is using the metadata of the partition instead of the data information, for example:
Instead of using this syntax:
select max(day) from my_table
Try to use this syntax:
SELECT day FROM my_schema."my_table$partitions" ORDER BY day DESC LIMIT 1
This second statement will read just metadata information and returns the same data you need.
It does not depend on the format but on the compression algorithm used. Snappy for ORC mostly & GZIP for parquet. This is what makes the difference

Redshift sequential scan on columns defined as sort keys

Referring to the attached image of the query plan of the query I'm executing and created_time is Interleaved sort key which is used as a range filter on the data.
Though it \looks like seq scan is happening on the table data, the rows scanned column seems empty in the image, does that mean there is no scan happening, and the sort keys work?
Even though created_time is your sort key, in this query it is not recognized since you are converting it to date. Therefore, it is scanning entire table.
You need to leave it unmodified for it to know that it is the sort key.

Reading (even joining) a very large (1.1bn row) table in Enterprise Guide from Teradata

Hopefully you guys can help with what I'm hoping is quite a simple question for those in the know!
I live (well, work) in SAS Enterprise Guide and am trying to perform a simple left join against a table in Teradata.
The table is extremely large (700+ columns, 1.1bn rows) and so far I have been connecting via a LIBNAME statement at the top of my program, followed by the usual PROC SQL to read the data.
The issue I am having is its is extremely slow. I performed the join successfully using 90 rows on the left table and it took 3 hours to complete. The real table I want to use has something like 15,000 rows.
I have tried to connect via the SQL Pass-Through method, but this throws a hosts file error, which I can't fix due to corporate security limitations.
Has anyone had any experience performing this kind of task?
I should mention that I can run a simple select * query in Teradata SQL Assistant is just over 1 minute (16,666,666 obs/s!) so the limitation must be somewhere between SAS/Teradata, or even SAS itself.
I'm sorry I haven't posted actual code snippets as they're on my work machine but this has been bugging me for ages so thought I'd see if I'm missing any tricks.
Thanks in advance for your help.
So you're joining a SAS data set to a Teradata table and want to return the matching records. You'll want to use SAS's DBMASTER= data set option. It designates which of the tables is larger. By telling SAS this, it knows which table to move.
Here I assume librefs have already been assigned and that the Teradata table is larger--more obs--than the SAS data set.
proc sql threads; select tdTable.* from sastables.sasTable1, td.tdTable(dbmaster=yes)
where tdTable.idNum=sasTable1.idNum; quit;
If by chance your SAS data set is larger, you'll want to use the MULTI_DATASRC_OPT= option. Either google these terms or look in the SAS/Access to Relational Databases manual. It's pretty good.
Good luck.
Have you considered creating a volatile table in Teradata? Since this is created in your spool allocation you shouldn't need explicit permissions to create the table. Once created you can load the SAS data set into the Volatile table and collect statistics on the table's join columns and filter columns. This will help the optimizer understand the demographics about your "small" table. The volatile table will only persist for the duration of your session and is not accessible across multiple sessions.
Then rewrite your SAS code to push-down the SQL to Teradata joining the large table to your volatile table. The results can be returned to SAS and loaded into another data set.
CREATE VOLATILE TABLE MyTable, NO FALLBACK
( ColA SMALLINT NOT NULL,
ColB VARCHAR(10) NOT NULL
) PRIMARY INDEX (ColA)
ON COMMIT PRESERVE ROWS /* This is important */
;
The primary index is how Teradata distributes the data and accesses the data. Tables distributed on the same column will join "AMP local" and will not require a redistribution. This is not always possible, as your primary index selection has to consider even distribution as well as access path. The primary index does not have to be unique, but can be.
Hope this helps.

Power Query Formula Language - Detect type of columns

In Power BI, I've got some query tables generated from imported data. All the data comes in as type 'Any', and I'm trying to automatically detect the type of the data in each column.
Some of the queries generate tables with columns based on the in-coming data - I don't know what the columns are going to be until the query runs and sets up the table (data comes from an Azure blob). As I will have quite a few tables to maintain, which columns can change (possibly new columns being added) with any data refresh, it would be unmanageable to go through all of them each time and press 'Detect Data Type' on the columns.
So I'm trying to figure out how I can do a 'Detect Data Type' in the query formula language to attach to the end of the query that generates the table columns. I've tried grabbing the first entry in a column and do Value.Type(column{0}), however this seems to come out as 'Text' for a column which has integers in it. Pressing 'Detect Data Type' does however correctly identifies the type as 'Whole Number'.
Does anyone know how to detect a column's entry types?
P.S. I'm not too worried about a column possibly holding values of different data types
You seem to have multiple issues here. And your solution will be fragile, there's a better way. But let's first deal with column type detection. Power Query uses the 'any' data type as it's go to data type. You can write a function that samples the rows of a column in a table does a best match data type detection then explicitly sets the data type of the column. This is probably messy and tricky since you need to do it once per column. This might be workable for a fixed schema but for a dynamic schema you'll run into a couple of things very quickly. First you'll need to write some crazy PQ code to list all the columns and run you function on each. This will work the first time, but might break in subsequent refreshes because data model changes are not allowed during refresh. If you're using a tool like Power BI Desktop, you'll be able to fix things up. If you publish your report to the Power BI service, you'll just see refresh errors.
Dynamic Schemas will suffer the same data model change issue I mentioned above.
The alternate solution that you won't have problems with is using a Direct Query data source instead of using Power Query. If you load your data into Azure SQL or a Tabular Model, the reporting layer will get the updated fields automatically so you don't have to try to work around using PQ.