Showing BigQuery sharded tables as a group on BigQuery Web UI - google-cloud-platform

When I look at BigQuery tables auto created as part of the Google Analytics to BigQuery data export, tables that are created on a daily basis are grouped as below.
I also have a set of manually created tables that have a common prefix.
e.g.
test_table_123
test_table_235
etc..
Currently these table are shown as:
I would like these tables to be shown as test_table_(2) instead.
Can I know how I can achieve this?

You are referring about _TABLE_SUFFIX:
It can be used in almost any format <string><_TABLE_SUFFIX> and can be used for querying multiple tables.
I couldn't find the documentation about the representation on the BigQuery Webui, but, the only format that is compressed as you want on the UI (table_name_(<number_of_tables>)) is when you use a date, in the format table_name_yyyymmdd, other table suffixes are not being compressed by the Webui.
Ways to go:
Use a table name with the format table_name_yyyymmdd if it makes sense for your application
or
Open a feature request on issue-tracker.
Examples of my tests:
one table with integer suffix and another with date suffix:
Date suffix compress the tables and create a filter on the UI:
Integer suffix don't compress or create filter:

Related

With SQL or Python how can I find out if a table is part of a sharded set of tables (in BigQuery)?

I want to find out what my table sizes are (in BigQuery).
However I want to sum up the size of of all tables that belong to a specific set of sharded tables.
So I need to find metadata that shows that a table is part of a set of sharded tables.
So I can do: How to get BigQuery storage size for a single table
select
sum(size_bytes)/pow(2, 30) as size_gb
from
<your_dataset>.__TABLES__
But here I can't see if the table is part of a set of sharded set of tables.
This is what my Google Analytics sharded tables look like in BQ:
So somewhere must be metadata that indicates that tables with for example name ga_sessions_20220504 belong to a sharded set ga_sesssions_
Where/how can I find that metadata?
I think you are exploring the right query, most of the time, I use the following query to drill down on shards & it's sizes
SELECT
project_id,
dataset_id,
table_id,
array_reverse(SPLIT(table_id, '_'))[OFFSET(0)] AS shard_pt,
DATE(TIMESTAMP_MILLIS(creation_time)) creation_dt,
ROUND(size_bytes/POW(1024, 3), 2) size_in_gb
FROM
`<project>.<dataset>.__TABLES__`
WHERE
table_id LIKE 'ga_sessions_%'
ORDER BY
4 DESC
Result (on some random GA dataset I have access to FYI)
There is no metadata on Sharded tables via SQL.
Tables being displayed as Sharded in BigQuery UI happens when you do the following ->
Create 2 or more tables that have the following characteristics:
exist in the same dataset
have the exact same table schema
the same prefix
have a suffix of the form _YYYYMMDD (eg. 20210130)
These are something of a legacy feature, they were more commonly used with bigquery’s legacy SQL.
This blog was very insightful on this:
https://mark-mccracken.medium.com/bigquery-date-sharding-vs-date-partitioning-cee3754f7900

Perform data mapping in GCP

I have data coming from multiple hotels. These hotels are not using the same naming convention for storing the order information. I have a predefined dataset created in the bigquery(called hotel_order). I
want to map the data coming from different hotels to the single dataset in GCP, so it is easier for me to do comparisons in the bigquery.
If the column name(from hotel1) matches the bigquery dataset columnname, then the bigquery should load the data in the column, if the columnnames (from hotel orders data and dataset in bigquery) don't match, then column in the bigquery should have the null value. How do I do implement this in GCP? Problem of mapping in the GCP?
If you want to join tables together, and show a null value when a match doesn't exist, then you can do so using 'left join'.
Rough example
from hotel.orders as main left join hotel_number_one as Hotel_One on main.order_information = Hotel_One.order_information
Difficult to give a more detailed answer without more details or a working example using dbfiddle.

How does Amazon Athena manage rename of columns?

everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.

Is it possible to query log data stored Cloud Storage without Cleaning it using BigQuery?

I have a huge amount of log data exported from StackDriver to Google Cloud Storage. I am trying to run queries using BigQuery.
However, while creating the table in BigQuery Dataset I am getting
Invalid field name "k8s-app".
Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.
Table: bq_table
A huge amount of log data is exported from StackDriver sinks which contains a large number of unique column names. Some of these names aren't valid as per BigQuery tables.
What is the solution for this? Is there a way to query the log data without cleaning it? Using temporary tables or something else?
Note: I do not want to load(put) my data into BigQuery Storage, just to query data which is present in Google Cloud Storage.
* EDIT *
Please refer to this documentation for clear understanding
I think you can go any of these two routes based on your application:
A. Ignore Header
If the problematic field is in the header row of your logs, you can choose to ignore the header row by adding the --skip_leading_rows=1 parameter in your import command. Something like:
bq location=US load --source_format=YOURFORMAT --skip_leading_rows=1 mydataset.rawlogstable gs://mybucket/path/* 'colA:STRING,colB:STRING,..'
B. Load Raw Data
If the above is not applicable, then just simply load the data in its un-structured raw format into BigQuery. Once your data is in there, you can go about doing all sorts of stuff.
So, first create a table with a single column:
bq mk --table mydataset.rawlogstable 'data:STRING'
Now load your dataset in the table providing appropriate location:
bq --location=US load --replace --source_format=YOURFORMAT mydataset.rawlogstable gs://mybucket/path/* 'data:STRING'
Once your data is loaded, now you can process it using SQL queries, and split it based on your delimiter and skip the stuff you don't like.
C. Create External Table
If you do not want to load data into BigQuery but still want to query it, you can choose to create an external table in BigQuery:
bq --location=US mk --external_table_definition=data:STRING#CSV=gs://mybucket/path/* mydataset.rawlogstable
Querying Data
If you pick option A and it works for you, you can simply choose to query your data the way you were already doing.
In the case you pick B or C, your table now has rows from your dataset as singular column rows. You can now choose to split these singular column rows into multiple column rows, based on your delimiter requirements.
Let's say your rows should have 3 columns named a,b and c:
a1,b1,c1
a2,b2,c2
Right now its all in the form of a singular column named data, which you can separate by the delimiter ,:
select
splitted[safe_offset(0)] as a,
splitted[safe_offset(1)] as b,
splitted[safe_offset(2)] as c
from (select split(data, ',') as splitted from `mydataset.rawlogstable`)
Hope it helps.
To expand on #khan's answer:
If the files are JSON, then you won't be able to use the first method (skip headers).
But you can load each JSON row raw to BigQuery - as if it was a CSV - and then parse in BigQuery
Find a full example for loading rows raw at:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And then you can use JSON_EXTRACT_SCALAR to parse JSON in BigQuery - and transform the existing field names into BigQuery compatible ones.
Unfortunately no!
As part of log analytics, it is common to reshape the log data and run few ETL's before the files are committed to a persistent sink such as BigQuery.
If performance monitoring is all you need for log analytics, and there is no rationale to create additional code for ETL, all metrics can be derived from REST API endpoints of stackdriver monitoring.
If you do not need fields containing - you can set up to ignore ignore_unknown_values. You have to provide the schema you want and using ignore_unknown_values any field not matching the schema will be ignored.

BigQuery Table multiple parition a day

I am trying to load some Avro format data to BigQuery through the api and I need some partitioning. According to the documentation here
https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#TimePartitioning
It will create only one partition a day with the ingestion partition that use the _PARTITIONTIME column. Is it possible to create multiple partition a day by using timestamp field?
Another option I can think about was the ranged partition documented here
https://cloud.google.com/bigquery/docs/reference/rest/v2/JobConfiguration#RangePartitioning
however, it was marked as experimental. Not sure it is good for production use?