How to set a custom partition expiration in BigQuery - google-cloud-platform

I have a dataset for which the Default table expiration is 7 days. I want only one of the tables within this dataset to never expire.
I found the following bq command : bq update --time_partitioning_expiration 0 --time_partitioning_type DAY project-name:dataset-name.table_name
The problem is my tables have a partitionning suffix so they're named like this example :
REF_PRICE_20210921, REF_PRICE_20210922, etc... so the table name per se is REF_PRICE_.
I can't seem to apply the bq command on this partitionned table. As I get an error BigQuery error in update operation: Not found: Table project-name:dataset-name.REF_PRICE_ but it does exist. What am I doing/understanding wrong?
EDIT : My tables are not "partitionned" but sharded; they are wildcard tables, and so separate. It is not possible to set an expiration date for those tables apparently, unless it's done on each one individually.

Have you tried suffixing the table name with * like REF_PRICE_* ?
Moreover you should read this post because you might have created sharded tables while you wanted partitioned one.

Related

With SQL or Python how can I find out if a table is part of a sharded set of tables (in BigQuery)?

I want to find out what my table sizes are (in BigQuery).
However I want to sum up the size of of all tables that belong to a specific set of sharded tables.
So I need to find metadata that shows that a table is part of a set of sharded tables.
So I can do: How to get BigQuery storage size for a single table
select
sum(size_bytes)/pow(2, 30) as size_gb
from
<your_dataset>.__TABLES__
But here I can't see if the table is part of a set of sharded set of tables.
This is what my Google Analytics sharded tables look like in BQ:
So somewhere must be metadata that indicates that tables with for example name ga_sessions_20220504 belong to a sharded set ga_sesssions_
Where/how can I find that metadata?
I think you are exploring the right query, most of the time, I use the following query to drill down on shards & it's sizes
SELECT
project_id,
dataset_id,
table_id,
array_reverse(SPLIT(table_id, '_'))[OFFSET(0)] AS shard_pt,
DATE(TIMESTAMP_MILLIS(creation_time)) creation_dt,
ROUND(size_bytes/POW(1024, 3), 2) size_in_gb
FROM
`<project>.<dataset>.__TABLES__`
WHERE
table_id LIKE 'ga_sessions_%'
ORDER BY
4 DESC
Result (on some random GA dataset I have access to FYI)
There is no metadata on Sharded tables via SQL.
Tables being displayed as Sharded in BigQuery UI happens when you do the following ->
Create 2 or more tables that have the following characteristics:
exist in the same dataset
have the exact same table schema
the same prefix
have a suffix of the form _YYYYMMDD (eg. 20210130)
These are something of a legacy feature, they were more commonly used with bigquery’s legacy SQL.
This blog was very insightful on this:
https://mark-mccracken.medium.com/bigquery-date-sharding-vs-date-partitioning-cee3754f7900

How to fetch the latest schema change in BigQuery and restore deleted column within 7 days

Right now I fetch columns and data type of BQ tables via the below command:
SELECT COLUMN_NAME, DATA_TYPE
FROM `Dataset`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name="User"
But if I drop a column using command : Alter TABLE User drop column blabla:
the column blabla is not actually deleted within 7 days(TTL) based on official documentation.
If I use the above command, the column is still there in the schema as well as the table Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
It is just that I cannot insert data into such column and view such column in the GCP console. This inconsistency really causes an issue.
If I want to write bash script to monitor schema changes and do some operation based on it.
I need more visibility on the table schema of BigQuery. The least thing I need is:
Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS can store a flag column that indicates deleted or TTL:7days
My questions are:
How can I fetch the correct schema in spanner which reflects the recently deleted the column?
If the column is not actually deleted, is there any way to easily restore it?
If you want to fetch the recently deleted column you can try searching through Cloud Logging. I'm not sure what tools Spanner supports but if you want to use Bash you can use gcloud to fetch logs. Though it will be difficult to parse the output and get the information you want.
Command used below fetched the logs for google.cloud.bigquery.v2.JobService.InsertJob since an ALTER TABLE is considered as an InsertJob and filter it based from the actual query where it says drop. The regex I used is not strict (for the sake of example), I suggest updating the regex to be stricter.
gcloud logging read 'protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" AND protoPayload.metadata.jobChange.job.jobConfig.queryConfig.query=~"Alter table .*drop.*"'
Sample snippet from the command above (Column PADDING is deleted based from the query):
If you have options other than Bash, I suggest that you create a BQ sink for your logging and you can perform queries there and get these information. You can also use client libraries like Python, NodeJS, etc to either query in the sink or directly query in the GCP Logging.
As per this SO answer, you can use the time travel feature of BQ to query the deleted column. The answer also explains behavior of BQ to retain the deleted column within 7 days and a workaround to delete the column instantly. See the actual query used to retrieve the deleted column and the workaround on deleting a column on the previously provided link.

Fetch Schedule data from a BigQuery Table to another BigQuery Table (Scheduled queries)

I am really new to GCP and I am trying to Query in a GCP BigQuery to fetch all data from one BigQuery table and Insert all into another BigQuery table
I am trying the Following query where Project 1 & Dataset.Table1 is the Project where I am trying to read the data. and Project 2 and Dataset2.Table2 is the Table where I am trying to Insert all the data with the same Naming
SELECT * FROM `Project1.DataSet1.Table1` LIMIT 1000
insert INTO `Project2.Dataset2.Table2`
But am I receiving a query error message?
Does anyone know how to solve this issue?
There may be a couple of comments...
The syntax might be different => insert into table select and so on - see DML statements in the standard SQL
Such approach of data coping might not be very optimal considering time and cost. It might be better to use bq cp -f ... commands - see BigQuery Copy — How to copy data efficiently between BigQuery environments and bq command-line tool reference - if that is possible in your case.
The correct syntax of the query is as suggested by #al-dann. I will try to explain further with a sample query as below:
Query:
insert into `Project2.Dataset2.Table2`
select * from `Project1.DataSet1.Table1`
Input Table:
This will insert values into the second table as below:
Output Table:

Does hierarchical partitioning works in AWS Athena/S3?

I am new to AWS and trying to use S3 and Athena for a use case.
I want the data saved as json files in S3 to be queried from Athena. To reduce the data scan i have created directory structure like this
../customerid/date/*.json (format)
../100/2020-04-29/*.json
../100/2020-04-30/*.json
.
.
../101/2020-04-29/*.json
In Athena the table structure has been created according to the data we are expecting and 2 partitions have been created namely customer (customerid) and dt (date).
I want to query all the data for customer '100' and limit my scan to its directory for which i am trying to load the partition as follows
alter table <table_name> add
partition (customer=100) location 's3://<location>/100/’
But I get the following error
FAILED: SemanticException partition spec {customer=100} doesn't contain all (2) partition columns
Clearly its not loading a single partition when multiple partitions have been created
Giving both partitions in alter table
alter table <table_name> add
partition (customer=100, dt=2020-04-22) location 's3://<location>/100/2020-04-22/'
I get this error
missing 'column' at 'partition' (service: amazonathena; status code: 400; error code: invalidrequestexception;
Am i doing something wrong?
Does this even works?
If not is there a way to work with hierarchical partitions?
The issue is with the S3 hierarchial structure that you have given. it should be
../customer=100/dt=2020-04-29/*.json
instead of
../100/2020-04-29/*.json
If you have data in S3 stored in the correct prefix structure as mentioned, then you could add partitions with simple msck repair msck repair table <table_name> command.
Hope this clarifies
I figured the mistake i was making so wanted to share in case anyone finds themselves in the same situation.
For data not partitioned in hive format (Refer this for hive and non-hive format)
Taking the example above, following is the alter command that works
alter table <table_name> add
partition (customer=100, dt=date '2020-04-22') location 's3://<location>/100/2020-04-22/'
Notice the change in the syntax of "dt" partition. As my partition datatype was set to "date" type and not using it while loading the partition was giving the error.
although not giving the data type also works we just need to give single quotes that defaults the partition type to string/varchar
alter table <table_name> add
partition (customer=100, dt='2020-04-22') location 's3://<location>/100/2020-04-22/'
I prefer giving the date data type while adding as that is how i configured my partition.
Hope this helps.

Aws Athena - Rename column name

I am trying to change a column name in an AWS Athena table.
From old_name to new_name.
Normal DDL commands does not affect the table (They cannot be executed).
Is It possible to change a column name without deleting and re-creating the table from scratch ?
I was mistaken, Athena uses HIVE DDL syntax so the correct command is :
ALTER TABLE %%table-name%% CHANGE %%old-column-name%% %%new-column-name%%<string>;
I based my answer on a hive related question.
You can find more about supported and unsupported DDLs here