Is it possible to retrieve all records from only one BigQuery table cluster? - google-cloud-platform

BigQuery's documentation says the following about clustered tables:
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Since the records in the table are already colocated and sorted, is it possible to retrieve all the records from an arbitrary cluster?
I have a table that is too large to use ORDER BY. However, it is already clustered in the manner I need, so I can save a lot of time and expense if I could retrieve all the data from each cluster separately.

Related

With SQL or Python how can I find out if a table is part of a sharded set of tables (in BigQuery)?

I want to find out what my table sizes are (in BigQuery).
However I want to sum up the size of of all tables that belong to a specific set of sharded tables.
So I need to find metadata that shows that a table is part of a set of sharded tables.
So I can do: How to get BigQuery storage size for a single table
select
sum(size_bytes)/pow(2, 30) as size_gb
from
<your_dataset>.__TABLES__
But here I can't see if the table is part of a set of sharded set of tables.
This is what my Google Analytics sharded tables look like in BQ:
So somewhere must be metadata that indicates that tables with for example name ga_sessions_20220504 belong to a sharded set ga_sesssions_
Where/how can I find that metadata?
I think you are exploring the right query, most of the time, I use the following query to drill down on shards & it's sizes
SELECT
project_id,
dataset_id,
table_id,
array_reverse(SPLIT(table_id, '_'))[OFFSET(0)] AS shard_pt,
DATE(TIMESTAMP_MILLIS(creation_time)) creation_dt,
ROUND(size_bytes/POW(1024, 3), 2) size_in_gb
FROM
`<project>.<dataset>.__TABLES__`
WHERE
table_id LIKE 'ga_sessions_%'
ORDER BY
4 DESC
Result (on some random GA dataset I have access to FYI)
There is no metadata on Sharded tables via SQL.
Tables being displayed as Sharded in BigQuery UI happens when you do the following ->
Create 2 or more tables that have the following characteristics:
exist in the same dataset
have the exact same table schema
the same prefix
have a suffix of the form _YYYYMMDD (eg. 20210130)
These are something of a legacy feature, they were more commonly used with bigquery’s legacy SQL.
This blog was very insightful on this:
https://mark-mccracken.medium.com/bigquery-date-sharding-vs-date-partitioning-cee3754f7900

how to create partition and cluster on an existing table in big query?

In SQL Server , we can create index like this. How do we create the index after the table already exists? What is the syntax of create clusted index in bigquery?
CREATE INDEX abcd ON `abcd.xxx.xxx`(columnname )
In big query, we can create table like below. But how to create partition and cluster on an existing table?
CREATE TABLE rep_sales.orders_tmp PARTITION BY DATE(created_at) CLUSTER BY created_at AS SELECT * FROM rep_sales.orders
As #Sergey Geron mentioned in the comments, BigQuery doesn’t support indexes. For more information, please refer to this doc.
An existing table cannot be partitioned but you can create a new partitioned table and then load the data into it from the unpartitioned table.
As for clustering of tables, BigQuery supports changing an existing non-clustered table to a clustered table and vice versa. You can also update the set of clustered columns of a clustered table. This method of updating the clustering column set is useful for tables that use continuous streaming inserts because those tables cannot be easily swapped by other methods.
You can change the clustering specification in the following ways:
Call the tables.update or tables.patch API method.
Call the bq command-line tool's bq update command with the --clustering_fields flag.
Note: When a table is converted from non-clustered to clustered or the clustered column set is changed, automatic re-clustering only works from that time onward. For example, a non-clustered 1 PB table that is converted to a clustered table using tables.update still has 1 PB of non-clustered data. Automatic re-clustering only applies to any new data committed to the table after the update.

BigQuery - querying only a subset of keys in a table with key value schema

So I have a table with the following schema:
timestamp: TIMESTAMP
key: STRING
value: FLOAT
There are around 200 unique keys. I am partitioning the dataset by date.
I want to run several (5-6 currently, but I expect to add at least 15 more) queries on a daily basis on this database. Brute forcing these would cost me a lot daily, which I want to avoid.
The issue is that because of this key - value format, and BigQuery being a columnar database, each query queries the whole day's data, despite each query actually using a maximum of 4 keys. What is a best way to optimize this?
I am thinking the best way I can go about it right now is to create separate temp tables for each key as a daily batch process, run my queries on them and then delete them.
Ideal way I would want to go about it is partitioning by key, I am not sure there is any such provision?
You can try using recently introduced clustering partitioned tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
Update (moved from comments)
Also have in mind below
Feature Partitioning Clustering
--------------- ------------- -------------
Cardinality Less than 10k Unlimited
Dry Run Pricing Available Not available
Query Pricing Exact Best Effort
Pay special attention to Dry Run Pricing - unfortunately - clustered tables do not support dry run (validation) based on clustered keys - and rather show only validation based on partitions. but if you set your clustering properly - actual run will end up with lower cost. you should try with smaller data to get comfortable with this
See more at Clustering partitioned tables

Columnar database queries in Amazon Redshift

I'm learning Amazon Redshift. Heard that it is very powerful storage on cloud and works very fast on data where aggregate operations are required because it stores data column-wise.
Am not able to find any example queries? Could someone share with me some examples of Aggregate queries running on Amazon Redshift? Is it different from normal relation database queries?
You are correct -- Amazon Redshift is a columnar database. This means that data is stored on disk per column, making operations on a column very fast. For example, adding the Sales column for a particular value in the Country column only requires accessing two columns rather than all columns in a table.
Other benefits are that data in Redshift is compressed (which works well with the columnar concept, because each column uses its own compression method based on the data stored) and the fact that it is a clustered database, so compute and storage can be scaled by adding additional nodes.
Amazon Redshift presents itself as a PostgreSQL database, so you just use industry-standard SQL to query data. No changes to queries are required.
However, you can optimize Redshift by wisely choosing a Distribution Key for each table that determines how data is distributed amongst nodes, and carefully select the Sort Key, which determines how data is stored on each node. Put simply, data should be distributed by how you JOIN tables and should be sorted by what you use in WHERE statements.
As for sample queries... it totally depends upon your data! Queries look exactly the same as normal SQL.

Azure SQL Data Warehouse CTAS statistics

Does the "Create table as" function in SQL Data Warehouse create statistics in the background, or do they have to manually be created (as I would when I do a normal "Create table" statement?)
As of the current version, you always have to create column-level statistics on tables, irrespective of whether it was created with a normal CREATE TABLE or the CTAS CREATE TABLE AS... command. It's also good practice to create stats for columns used in JOINs, WHERE clauses, GROUP BY, ORDER BY and DISTINCT clauses.
Regarding tables created with CTAS, the database engine has a correct idea of how many rows are in the table as listed in sys.partitions, but not at the column-level statistics level. For tables created by CREATE TABLE this defaults to 1,000 rows. For the example below, the first table was created with a CTAS and has 208 rows, the second table with an ordinary CREATE TABLE and INSERT from the first table and also has 208 rows, but sys.partitions believes it to have 1,000 eg
Creating any column-level statistics manually will correct this number.
In summary, always manually create statistics against important columns irrespective of how the table was created.