Do views of tables in BigQuery benefit from partitioning/clustering optimization? - google-cloud-platform

We have a few tables in BigQuery that are being updated nightly, and then we have a deduplication process doing garbage collection slowly.
To ensure that our UI is always showing the latest, we have a view setup for each table that simply does a SELECT WHERE on the newest timestamp record_id combination
We're about to setup partitioning and clustering to optimize query scope/speed and I couldn't find a clear answer in Google documentation on whether the view of that table will still have partitioned queries or it will end up querying all data.
Alternatively when we create the view, can we include the partition and cluster on in the query that builds the view?

If you're talking about a logical view, then yes if the base table it references is clustered/partitioned it will use those features if they're referenced from the WHERE clause. The logical view doesn't have its own managed storage, it's just effectively a SQL subquery that gets run whenever the view is referenced.
If you're talking about a materialized view, then partitioning/clustering from the base table isn't inherited, but can be defined on the materialized view. See the DDL syntax for more details: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_materialized_view_statement

Related

How is the user defined aggregations different to having a table aggregated via sql/power query in import mode?

I'm exploring the concept of aggregations in Power BI.
I understand that the auto feature is only available when the corresponding table is in direct query mode. Where as manual aggregations is supported for import as well as direct query mode.
Effectively we can write an aggregate group by SQL Query to fetch the data and load it into an aggregate table in import mode.
I'm unable to understand how this helps because since the aggregate table is in Import mode, the source database table will need to be queried at each data refresh. So what problem does aggregates solve?
the source database table will need to be queried at each data refresh.
But the source database table will not need to be queried for each visual render. Visuals render for each user on load and on any change to another visual (like selecting in a slicer, or cross-highligting in a chart). In pure DirectQuery every visual render sends a DAX query, which translates to one or more SQL queries.
SQL queries over large fact tables can require many seconds of CPU time in the source, where the same DAX can read an in-memory aggregation table in a few milliseconds.
So the problem it solves is to reduce the number of queries to the source from report users. Typically you create aggregates that satisfy the initial views of reports and dashboards, and the most common slicers and filters used by the reports.

extra table is created after creating a Redshift Materialized view

Created a redshift materialized view (view name: lirt_cases_mv) to use external schema. However, this extra table is created mv_tbl__lirt_cases_mv__0. Does anyone know why this extra table is created? Is there a way to prevent creating this extra table?
Thank you for your help.
I found this article that attempts to explain the process of Materialized View refreshes under the hood in Redshift. Here is what the article suggests:
A stored procedure called mv_sp__house_price_mvw__0_0 is invoked
As part of the procedure, a backup table called mv_tbl__house_price_mvw__0__tmp is created using the materialized view query
A view called house_price_mvw is create/replaced using the _tmp table
The table mv_tbl__house_price_mvw__0 is dropped
The _tmp table is renamed to mv_tbl__house_price_mvw__0
The house_price_mvw view is create/replaced based on mv_tbl__house_price_mvw__0
If this is correct, then mv_tbl__lirt_cases_mv__0 would be the source object responsible for create/replacing your materialized view lirt_cases_mv, and I don't think there would be any way around having it.
I've not verified that everything the author says in this article is true, but you can reproduce and verify for yourself by querying from svl_statementtext after performing a full refresh of the materialized view.

BigQuery create view based on future tables using wildcard expresssion

Using terraform, I'm trying to create a view that is responsible of consolidating multiple system tables into a single 'master' system table, e.g.,
system_system1
system_system2
...
The following query is used to create the view:
SELECT * except(non_shared_cols) FROM `project.dataset.system_*`
This works as expected, and downstream components can use the master table to compute metrics. However, most of the tables do not exist at creation time of the view, hence I'm getting the following error:
project:dataset.system_* does not match any table.
I assumed the view would be resolved at query time, but apparently this is not the case. Is there any other BigQuery concept I could rely on to create this view?
Or is this just some kind of a safety check which I can avoid somehow?
I could ofcourse create a 'dummy' table in Terraform, but this seems really tedious as I need to know the shared schema of the BigQuery tables in advance.

Create a QuickSight dataset from a PostgreSQL materialized view

The AWS QuickSight Documentation mentions that:
You can retrieve data from tables and materialized views in PostgreSQL instances, and from tables in all other database instances.
When creating a dataset from my PostgreSQL 9.5 database, none of my materialized views display in the list to select from.
Is the documentation incorrect? Is there somewhere else I should be selecting from?
I haven't used views as my source. However I can usually see tables only from one schema. Maybe your views are in different schema?
If that is not the case, just use Query instead of Table as source for your dataset and just select * from myview.

How can I create a model with ActiveRecord capabilities but without an actual table behind?

I think this is a recurrent question in the Internet, but unfortunately I'm still unable to find a successful answer.
I'm using Ruby on Rails 4 and I would like to create a model that interfaces with a SQL query, not with an actual table in the database. For example, let's suppose I have two tables in my database: Questions and Answers. I want to make a report that contains statistics of both tables. For such purpose, I have a complex SQL statement that takes data from these tables to build up the statistics. However the SELECT used in the SQL statement does not directly take values from neither Answers nor Questions tables, but from nested SELECTs.
So far I've been able to create the StatItem model, without any migration, but when I try StatItem.find_by_sql("...nested selects...") the system complains about unexisting table stat_items in the database.
How can I create a model whose instance's data is retrieved from a complex query and not from a table? If it's not possible, I could create a temporary table to store the data in there. In such case, how can I tell the migration file to not create such table (it would be created by the query)?
How about creating a materialized view from your complex query and following this tutorial:
ActiveRecord + PostgreSQL Materialized Views
Michael Kohl and his proposal of materialized views has given me an idea, which I initially discarded because I wrongly thought that a single database connection could be shared by two processes, but after reading about how Rails processes requests, I think my solution is fine.
STEP 1 - Create the model without migration
rails g model StatItem --migration=false
STEP 2 - Create a temporary table called stat_items
#First, drop any existing table created by older requests (database connections are kept open by the server process(es).
ActiveRecord::Base.connection.execute('DROP TABLE IF EXISTS stat_items')
#Second, create the temporary table with the desired columns (notice: a dummy column called 'id:integer' should exist in the table)
ActiveRecord::Base.connection.execute('CREATE TEMP TABLE stat_items (id integer, ...)')
STEP 3 - Execute an SQL statement that inserts rows in stat_items
STEP 4 - Access the table using the model, as usual
For example:
StatItem.find_by_...
Any comments/improvements are highly appreciated.