Using terraform, I'm trying to create a view that is responsible of consolidating multiple system tables into a single 'master' system table, e.g.,
system_system1
system_system2
...
The following query is used to create the view:
SELECT * except(non_shared_cols) FROM `project.dataset.system_*`
This works as expected, and downstream components can use the master table to compute metrics. However, most of the tables do not exist at creation time of the view, hence I'm getting the following error:
project:dataset.system_* does not match any table.
I assumed the view would be resolved at query time, but apparently this is not the case. Is there any other BigQuery concept I could rely on to create this view?
Or is this just some kind of a safety check which I can avoid somehow?
I could ofcourse create a 'dummy' table in Terraform, but this seems really tedious as I need to know the shared schema of the BigQuery tables in advance.
Related
I have a GCP DataFlow pipeline configured with a select SQL query that selects specific rows from a Postgres table and then inserts these rows automatically into the BigQuery dataset. This pipeline is configured to run daily at 12am UTC.
When the pipeline initiates a job, it runs successfully and copies the desired rows. However, when the next job runs, it copies the same set of rows again into the BigQuery table, hence resulting in data duplication.
I wanted to know if there is a way to truncate the BigQuery dataset table before the pipeline runs. It seems like a common problem so looking if there's an easy solution without going into a custom DataFlow template.
BigQueryIO has an option called WriteDisposition, where you can use WRITE_TRUNCATE.
From the link above, WRITE_TRUNCATE means:
Specifies that write should replace a table.
The replacement may occur in multiple steps - for instance by first removing the existing table, then creating a replacement, then filling it in. This is not an atomic operation, and external programs may see the table in any of these intermediate steps.
If your use case can not afford the table being unavailable during the operation, a common pattern is moving the data to a secondary / staging table, and then using atomic operations on BigQuery to replace the original table (e.g., using CREATE OR REPLACE TABLE).
Let's suppose I have a table in BigQuery and I create a dataset on VertexAI based on it. I train my model. A while later, the data gets updated several times in BigQuery.
But can I simply go to my model and get redirected to the exact version of he data it was trained on?
Using time travel, I can still access the historical data in BigQuery. But I didn't manage to go to my model and figure out on which version of the data it was trained and look at that data.
On the Vertex Ai creating a dataset from BigQuery there is this statement:
The selected BigQuery table will be associated with your dataset. Making changes to the referenced BigQuery table will affect the dataset before training.
So there is no copy or clone of the table prepared automatically for you.
Another fact is that usually you don't need the whole base table to create the database, you probably subselect based on date, or other WHERE statements. Essentially the point here is that you filter your base table, and your new dataset is only a subselect of it.
The recommended way is to create a dataset, where you will drop your table sources, lets call them vertex_ai_dataset. In this dataset you will store all your tables that are part of a vertex ai dataset. Make sure to version them, and not update them.
So BASETABLE -> SELECT -> WRITE AS vertex_ai_dataset.dataset_for_model_v1 (use the later in Vertex AI).
Another option is that whenever you issue a TRAIN action, you also SNAPSHOT the base table. But we aware this need to be maintained, and cleaned as well.
CREATE SNAPSHOT TABLE dataset_to_store_snapshots.mysnapshotname
CLONE dataset.basetable;
Other params and some guide is here.
You could also automate this, by observing the Vertex AI, train event (it should documented here), and use EventArc to start a Cloud Workflow, that will automatically create a BigQuery table snapshot for you.
We have a few tables in BigQuery that are being updated nightly, and then we have a deduplication process doing garbage collection slowly.
To ensure that our UI is always showing the latest, we have a view setup for each table that simply does a SELECT WHERE on the newest timestamp record_id combination
We're about to setup partitioning and clustering to optimize query scope/speed and I couldn't find a clear answer in Google documentation on whether the view of that table will still have partitioned queries or it will end up querying all data.
Alternatively when we create the view, can we include the partition and cluster on in the query that builds the view?
If you're talking about a logical view, then yes if the base table it references is clustered/partitioned it will use those features if they're referenced from the WHERE clause. The logical view doesn't have its own managed storage, it's just effectively a SQL subquery that gets run whenever the view is referenced.
If you're talking about a materialized view, then partitioning/clustering from the base table isn't inherited, but can be defined on the materialized view. See the DDL syntax for more details: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_materialized_view_statement
I am deploying Athena external tables, and want to update their definition without downtime, is there a way?
The ways I thought about are:
Create a new table and rename the old and then rename the new to the old name, first, it involves a very small downtime, and renaming tables doesn't seem to be supported (neither altering the definition).
The other way is to drop the table and recreate it, which obviously involves downtime.
If you use the Glue UpdateTable API call you can change a table definition without first dropping the table. Athena uses the Glue Catalog APIs behind the scenes when you do things like CREATE TABLE … and DROP TABLE ….
Please note that if you make changes to partitioned tables you also need to update all partitions to match.
Another way that does not involve using Glue directly would be to create a view for each table and only use these views in queries. When you need to replace a table you create a new table with the new schema, then recreate the view (with CREATE OR REPLACE) to use the new table, then drop the old table. I haven't checked, but it would make sense if replacing a view used the UpdateTable API behind the scenes.
AWS Redshift team recommend using TRUNCATE in order to clean up a large table.
I have a continuous EC2 service that keeps adding rows to a table. I would like to apply some purging mechanism, so that when the cluster is near full it will auto delete old rows (say using the index column).
Is there some best practice for doing that?
Do I need to write my own code to handle that? (if so is there already a Python script for that that I can use e.g. in a Lambda function?)
A common practice when dealing with continuous data is to create a separate table for each month, eg Sales-2018-01, Sales-2018-02.
Then create a VIEW that combines the tables:
CREATE VIEW sales AS
SELECT * FROM Sales-2018-01
UNION
SELECT * FROM Sales-2018-02
Then, create a new table each month and remove the oldest month from the View. This effectively gives a 12-month rolling view of the data.
The benefit is that data does not have to be deleted from tables (which would then require a VACUUM). Instead, the old table can simply be dropped, or kept around for historical reporting with a different View.
See: Using Time Series Tables - Amazon Redshift