cannot export partition by date table from BigQuery to GCP - google-cloud-platform

I'm trying to export the partition by day table to the GCP bucket.
My main goal is to create daily files in the GCP.
Here is my simple first test for the extraction:
bq extract --destination_format=CSV test.test_table$20210719 gs://test_partition/part_col=20210716/test_*.csv
test_table is a partition table and I do have 7k rows for the date 2021-07-19.
However, when I run this line, I get this
BigQuery error in extract operation: Error processing job 'test-net:bqjob_1234456789: Not found: Table test-
net:temp.test_partitiontime0210719 was not found in location EU
As you can see above the table name is temp.test_partitiontime0210719 there is no 2 at the beginning of it. Whenever $ sign is included in the table name, the first character after it is being removed.
I've tried to apply this GCP documentation.
Also, I have tried this comment, but it hasn't worked for me.
How can I extract the partition table to the GCP for a specific date?

It's a linux side effect. If you put $something linux search for an environment variable with the key something.
To skip that, you can use simple quote ' to say to linux: "Hey, do not evaluate this environment variable token", so do this
bq extract --destination_format=CSV 'test.test_table$20210719' gs://test_partition/part_col=20210716/test_*.csv

Related

How to fetch the latest schema change in BigQuery and restore deleted column within 7 days

Right now I fetch columns and data type of BQ tables via the below command:
SELECT COLUMN_NAME, DATA_TYPE
FROM `Dataset`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name="User"
But if I drop a column using command : Alter TABLE User drop column blabla:
the column blabla is not actually deleted within 7 days(TTL) based on official documentation.
If I use the above command, the column is still there in the schema as well as the table Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
It is just that I cannot insert data into such column and view such column in the GCP console. This inconsistency really causes an issue.
If I want to write bash script to monitor schema changes and do some operation based on it.
I need more visibility on the table schema of BigQuery. The least thing I need is:
Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS can store a flag column that indicates deleted or TTL:7days
My questions are:
How can I fetch the correct schema in spanner which reflects the recently deleted the column?
If the column is not actually deleted, is there any way to easily restore it?
If you want to fetch the recently deleted column you can try searching through Cloud Logging. I'm not sure what tools Spanner supports but if you want to use Bash you can use gcloud to fetch logs. Though it will be difficult to parse the output and get the information you want.
Command used below fetched the logs for google.cloud.bigquery.v2.JobService.InsertJob since an ALTER TABLE is considered as an InsertJob and filter it based from the actual query where it says drop. The regex I used is not strict (for the sake of example), I suggest updating the regex to be stricter.
gcloud logging read 'protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" AND protoPayload.metadata.jobChange.job.jobConfig.queryConfig.query=~"Alter table .*drop.*"'
Sample snippet from the command above (Column PADDING is deleted based from the query):
If you have options other than Bash, I suggest that you create a BQ sink for your logging and you can perform queries there and get these information. You can also use client libraries like Python, NodeJS, etc to either query in the sink or directly query in the GCP Logging.
As per this SO answer, you can use the time travel feature of BQ to query the deleted column. The answer also explains behavior of BQ to retain the deleted column within 7 days and a workaround to delete the column instantly. See the actual query used to retrieve the deleted column and the workaround on deleting a column on the previously provided link.

How to set a custom partition expiration in BigQuery

I have a dataset for which the Default table expiration is 7 days. I want only one of the tables within this dataset to never expire.
I found the following bq command : bq update --time_partitioning_expiration 0 --time_partitioning_type DAY project-name:dataset-name.table_name
The problem is my tables have a partitionning suffix so they're named like this example :
REF_PRICE_20210921, REF_PRICE_20210922, etc... so the table name per se is REF_PRICE_.
I can't seem to apply the bq command on this partitionned table. As I get an error BigQuery error in update operation: Not found: Table project-name:dataset-name.REF_PRICE_ but it does exist. What am I doing/understanding wrong?
EDIT : My tables are not "partitionned" but sharded; they are wildcard tables, and so separate. It is not possible to set an expiration date for those tables apparently, unless it's done on each one individually.
Have you tried suffixing the table name with * like REF_PRICE_* ?
Moreover you should read this post because you might have created sharded tables while you wanted partitioned one.

what file metadata is available in aws - Athena?

aws Athena allows the user to display the underlying file from which a row is being read, like so:
select timestamp, "$path" from table
I'm after a comprehensive list of other columns similar to $path. In particular I'm looking for a way to query the total size of those files (not the data scanned by the query, but the total size of the files).
AWS released Athena engine version 3 last month and I can see that it supports the hidden columns $file_size and $file_modified_time. But I couldn't find it advertised in the documentation.

How do I execute the SHOW PARTITIONS command on an Athena table?

I'm using AWS Athena with AWS Glue for the first time, with S3 providing a 'folder' structure which maps to the partitions in my data - I'm getting into the concepts so please excuse any mistaken description!
I'm looking at what happens when I add data to my S3 bucket and see that new folders are ignored. Digging deeper I came across the 'SHOW PARTITIONS' command, as described here https://docs.aws.amazon.com/athena/latest/ug/show-partitions.html, and I'm trying to execute this against my test tables using the Athena query editor, with a mind that I'll go onto use the 'ALTER TABLE ADD PARTITION' command to add the new S3 folders.
I'm trying to execute the 'SHOW PARTITIONS' command in the AWS Athena Console Query Editor:
SHOW PARTITIONS "froms3_customer-files"."unit"
but when I try to execute it I see this message:
line 1:17: missing {'from', 'in'} at '"froms3_customer-files"' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: c0c0c351-2d42-4da4-b1f3-223b1733db65)
I'm struggling to understand what this is telling me, can anyone help me here?
Athena does not supports hyphen in database name.
Athena table, view, database, and column names cannot contain special
characters, other than underscore (_).
Also remove the double quotes from the show partitions command.
SHOW PARTITIONS froms3_customer_files.unit
References :
Athena table and database naming convention
Athena show partitions
If you want to see all the partitions that are created till now you can use following command
SHOW PARTITIONS DB_NAME.TABLE_NAME
If you want to view the keys along which table is partitioned you can view it through UI in following way:
1. Click on the table menu options.
2. Click on Show Properties
3. Click on partitions to see partition.

Moving a partitioned table across regions (from US to EU)

I'm trying to move a partitioned table over from the US to the EU region but whenever I manage to do so, It doesn't partition the table on the correct column.
The current process that I'm taking is:
Create a Storage bucket in the region that I want the partitioned table to be in
Export the partitioned table over via CSV to the original bucket (within the old region)
Transfer the table across buckets (from the original bucket to the new one)
Create a new table using the CSV from the new bucket (auto-detect schema is on)
bq --location=eu load --autodetect --source_format=CSV table_test_set.test_table [project ID/test_table]
I expect that the column to be partitioned on the DATE column but instead it's partitioned on the column PARTITIONTIME
Also a note that I'm currently doing this with CLI commands. This will need to be redone multiple times and so having reusable code is a must.
When I migrate data from 1 table to another one, I follow this process
I extract the data to GCS (CSV or other format)
I extract the schema to the source table with this command bq show --schema <dataset>.<table>
I create via the GUI the destination table with the edit as text schema and I paste it. I define manually the partition field that I want to use from the schema;
I load the data from GCS to the destination table.
This process has 2 advantages:
When you import a CSV format, you define the REAL type that you want. Remember, in schema autodetect, Bigquery look about 10 or 20 lines and deduce the schema. Often, string fields are set as INTEGER but the first line of my file doesn't contains letter, only numbers (in serial number for example)
You can define your partition fields properly
The process is quite easy to script. I use the GUI for creating destination table, but bq command lines are great for doing the same thing.
After some more digging I managed to find out the solution. By using "--time_partitioning_field [column name]" you are able to partition by a specific column. So the command would look like this:
bq --location=eu --schema [where your JSON schema file is] load --time_partitioning_field [column name] --source_format=NEWLINE_DELIMITED_JSON table_test_set.test_table [project ID/test_table]
I also found that using JSON files to make things easier.