Generating dynamic backup tables script bigquery - google-cloud-platform

I have a task to create backup\copies of certain select tables from various datasets from one project into another (or) within the same project. A model query is listed below. There are about 300 odd tables in total.
CREATE OR REPLACE TABLE $target_project.$target_dataset.$table_name_$suffix AS
SELECT * FROM $source_project.$source_dataset.$table_name
In order to accomplish this, I have created a config table with the dataset name and table name. I have two approaches in mind -
Option 1 -
Create SQL script which loops through all the records from the config table, generates dynamic sql and executes them. Am unable to find a proper way to loop through table records, getting values in variables. The struct command only takes one query at a time.
Option 2 -
Create a sql file containing all the CREATE TABLE SCRIPT statements with placeholders for the project name and suffix names.
Use a DAG, pass the variables into the sql file and execute the script using Cloud Composer.
Option 2 is working and feasible. It is just not scalable owing the fact that any changes to be made would require one to modify the script again, re-upload etc.
Can someone help me with Option 1 (or) advise if there is any better way to accomplish this task?
We work with the gcp suite of products and am happy to engage with various tools to accomplish this.

Related

Truncate existing BigQuery table before DataFlow job runs

I have a GCP DataFlow pipeline configured with a select SQL query that selects specific rows from a Postgres table and then inserts these rows automatically into the BigQuery dataset. This pipeline is configured to run daily at 12am UTC.
When the pipeline initiates a job, it runs successfully and copies the desired rows. However, when the next job runs, it copies the same set of rows again into the BigQuery table, hence resulting in data duplication.
I wanted to know if there is a way to truncate the BigQuery dataset table before the pipeline runs. It seems like a common problem so looking if there's an easy solution without going into a custom DataFlow template.
BigQueryIO has an option called WriteDisposition, where you can use WRITE_TRUNCATE.
From the link above, WRITE_TRUNCATE means:
Specifies that write should replace a table.
The replacement may occur in multiple steps - for instance by first removing the existing table, then creating a replacement, then filling it in. This is not an atomic operation, and external programs may see the table in any of these intermediate steps.
If your use case can not afford the table being unavailable during the operation, a common pattern is moving the data to a secondary / staging table, and then using atomic operations on BigQuery to replace the original table (e.g., using CREATE OR REPLACE TABLE).

How to fetch the latest schema change in BigQuery and restore deleted column within 7 days

Right now I fetch columns and data type of BQ tables via the below command:
SELECT COLUMN_NAME, DATA_TYPE
FROM `Dataset`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name="User"
But if I drop a column using command : Alter TABLE User drop column blabla:
the column blabla is not actually deleted within 7 days(TTL) based on official documentation.
If I use the above command, the column is still there in the schema as well as the table Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
It is just that I cannot insert data into such column and view such column in the GCP console. This inconsistency really causes an issue.
If I want to write bash script to monitor schema changes and do some operation based on it.
I need more visibility on the table schema of BigQuery. The least thing I need is:
Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS can store a flag column that indicates deleted or TTL:7days
My questions are:
How can I fetch the correct schema in spanner which reflects the recently deleted the column?
If the column is not actually deleted, is there any way to easily restore it?
If you want to fetch the recently deleted column you can try searching through Cloud Logging. I'm not sure what tools Spanner supports but if you want to use Bash you can use gcloud to fetch logs. Though it will be difficult to parse the output and get the information you want.
Command used below fetched the logs for google.cloud.bigquery.v2.JobService.InsertJob since an ALTER TABLE is considered as an InsertJob and filter it based from the actual query where it says drop. The regex I used is not strict (for the sake of example), I suggest updating the regex to be stricter.
gcloud logging read 'protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" AND protoPayload.metadata.jobChange.job.jobConfig.queryConfig.query=~"Alter table .*drop.*"'
Sample snippet from the command above (Column PADDING is deleted based from the query):
If you have options other than Bash, I suggest that you create a BQ sink for your logging and you can perform queries there and get these information. You can also use client libraries like Python, NodeJS, etc to either query in the sink or directly query in the GCP Logging.
As per this SO answer, you can use the time travel feature of BQ to query the deleted column. The answer also explains behavior of BQ to retain the deleted column within 7 days and a workaround to delete the column instantly. See the actual query used to retrieve the deleted column and the workaround on deleting a column on the previously provided link.

Move entire dataset from one google project to another google project without data

as part of code deployment to production, we need to copy all tables from a big query dataset to production environment. However, the UI option or the bq command line option is moving the data too . How do I just move all the BIG QUERY tables at once from non prod to prod environment without data??
Kindly suggest?
posting my comment as an answer:
I don't know about any way how to achieve what you want directly, but there is a possible workaround:
You first need to create the dataset in the destination project and then run CREATE TABLE new_project.dataset.xx AS SELECT * FROM old_project.dataset.xx WHERE 1=0.
You also need to make sure to specify the partition field. This works well for datasets where there are just a few tables, for larger datasets you can script this operation in Python or whatever else you use.

Having trouble setting up multiple tables in AWS glue from a single bucket

So, I've used Glue before, but it's been with a single file <> single folder relationship.
What I'm trying to do now is to have a structure like this create individual tables for each folder:
- Data Bucket
- Table 1 Folder
- file1.csv
- file2.csv
- Table 2 Folder
- file1.csv
- file2.csv
...and so on.
But every time I create the crawler and set the Data Bucket as the data source, I only get a single table created. I've tried every combo of the "create single schema ...etc" I can think of.
I'm hoping that I don't have to add each sub-folder as a separate data source as my ultimate goal is to translate it eventually into an RDS instance. Hoping to keep the high-level bucket as the single data source if possible. I can easily tweak folder/file structure if needed.
And yes, I'm aware of partitioning, but isn't that only applicable to individual tables?
Thanks!
I ran into the same issue and digging into Glue docs, I found that setting table level in crawler's output configurations do the trick.
Table level seems to be set from the bucket level, in your case, I believe setting table level to 2 (the first folder after the root), would do the trick. 2 means that the tables definition starts at that point
I've been trying to accomplish the same thing. I was hoping that Glue would magically see the different folders and automatically create separate tables. Glue seems to want to create a single table, especially when the schemas overlap. In my example, I'm using US census data so there are some common fields, especially in the beginning of each file.
In the end, I was able to get this to work by creating multiple data stores in the Glue Crawler. By doing this, it would create the five separate tables I wanted, but I had to add each folder manually. Still hoping to find a way to get Glue to discover them automatically.

How to run dynamic queries in Informatica cloud mapping task?

I am new in informatica cloud. I have list of queries ready in my table. Like below.
Now I want to take one by one query from this table which work as a source query and whatever results return which I need to load into target. All tables were already created in source and target.
I just need to copy the data based on dynamic queries which kept in my one of sql tables.
If anyone have any idea then please share your toughs with me. It great helps to me.
The source connection will be the connector to your source database and the Source Type will be query. From there it depends how you are managing your variables. See thread on Informatica Network for links to multiple examples.
Read the table like normally you would do in the cloud. Then pass each of the record into the sql transformation for execution. configure where the sql transformation has to execute and it will run the queries in the database you want.
you can use a SQL task to run dynamic SQL queries.
link to using SQL task approach: https://www.datastackpros.com/2019/12/informatica-cloud-incremental-load_14.html