BigQuery: DDL statement not executing via client API - google-cloud-platform

I am executing CREATE TABLE IF NOT EXIST via client API using following JobConfigurationQuery:
queryConfig.setUseLegacySql(false)
queryConfig.setFlattenResults(false)
queryConfig.setQuery(query)
As I am executing CREATE TABLE DDL, I cannot specify destination table, write dispositions, etc. In my Query History section of Web UI, I see job being executed successfully without any exceptions, and with no writes happening. Is DDL statement not supported via client API?
I am using following client: "com.google.apis" % "google-api-services-bigquery" % "v2-rev397-1.23.0"

From BigQuery docs which says it seems that no error is returned when table exists:
The CREATE TABLE IF NOT EXISTS DDL statement creates a table with the
specified options only if the table name does not exist in the
dataset. If the table name exists in the dataset, no error is
returned, and no action is taken.
Answering your question, DDL is supported from API which is also stated in doc, to do this:
Call the jobs.query method and supply the DDL statement in the request
body's query property.

Related

How to fetch the latest schema change in BigQuery and restore deleted column within 7 days

Right now I fetch columns and data type of BQ tables via the below command:
SELECT COLUMN_NAME, DATA_TYPE
FROM `Dataset`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name="User"
But if I drop a column using command : Alter TABLE User drop column blabla:
the column blabla is not actually deleted within 7 days(TTL) based on official documentation.
If I use the above command, the column is still there in the schema as well as the table Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
It is just that I cannot insert data into such column and view such column in the GCP console. This inconsistency really causes an issue.
If I want to write bash script to monitor schema changes and do some operation based on it.
I need more visibility on the table schema of BigQuery. The least thing I need is:
Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS can store a flag column that indicates deleted or TTL:7days
My questions are:
How can I fetch the correct schema in spanner which reflects the recently deleted the column?
If the column is not actually deleted, is there any way to easily restore it?
If you want to fetch the recently deleted column you can try searching through Cloud Logging. I'm not sure what tools Spanner supports but if you want to use Bash you can use gcloud to fetch logs. Though it will be difficult to parse the output and get the information you want.
Command used below fetched the logs for google.cloud.bigquery.v2.JobService.InsertJob since an ALTER TABLE is considered as an InsertJob and filter it based from the actual query where it says drop. The regex I used is not strict (for the sake of example), I suggest updating the regex to be stricter.
gcloud logging read 'protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" AND protoPayload.metadata.jobChange.job.jobConfig.queryConfig.query=~"Alter table .*drop.*"'
Sample snippet from the command above (Column PADDING is deleted based from the query):
If you have options other than Bash, I suggest that you create a BQ sink for your logging and you can perform queries there and get these information. You can also use client libraries like Python, NodeJS, etc to either query in the sink or directly query in the GCP Logging.
As per this SO answer, you can use the time travel feature of BQ to query the deleted column. The answer also explains behavior of BQ to retain the deleted column within 7 days and a workaround to delete the column instantly. See the actual query used to retrieve the deleted column and the workaround on deleting a column on the previously provided link.

How to insert records into BigQuery Linked server

I have used Simba ODBC driver to connect SQL server to Bigquery as linked server in SQL Server Management Studio.
Not able to insert into BigQuery, only able to select data from BigQuery. I have checked 'AllowInProcess' and 'NonTransactedUpdate' too.
select * from openquery([GoogleBigQuery], 'select * from first.table2' )
The above select query is working.
Query:
insert into OPENQUERY([GoogleBigQuery], 'select * from first.table2') values (1,'c')
Error generated:
"The OLE DB provider "MSDASQL" for linked server "GoogleBigQuery"
could not INSERT INTO table "[MSDASQL]" because of column "id". The
user did not have permission to write to the column."
Query:
INSERT INTO [GoogleBigQuery].[midyear-byway-252503].[first].[table2] select * from Learning_SQL.dbo.demo
Error generated:
OLE DB provider "MSDASQL" for linked server "GoogleBigQuery" returned message "Multiple-step OLE DB operation generated errors. Check each OLE DB status value, if available. No work was done.".
The OLE DB provider "MSDASQL" for linked server "GoogleBigQuery" could not INSERT INTO table "[GoogleBigQuery].[midyear-byway-252503].[first].[table2]" because of column "id". The user did not have permission to write to the column.
Was wondering if anyone has tried inserting into a dataset in BigQuery using Linked server.
This error is due to this limitation. It seems that Microsoft's SQL Server "Linked Servers" option does not support making INSERT, UPDATE, or DELETE calls to the external database being linked to unless the connection supports transactions.
Since BigQuery does not support explicit transactions, MSSQL would not allow INSERT, UPDATE, or DELETE calls to BigQuery.
If you would like to insert data into BigQuery, consider exporting the data into a file, and load that file into BigQuery.
The import file can be in Avro, CSV, JSON (newline delimited only), ORC, or Parquet format.
For more information, refer to importing data into BigQuery,

How to fix 'Request contains an invalid argument' error when scheduling queries in BigQuery UI

I'm setting up a scheduled query in the new BigQuery UI as the project owner and have enabled the data transfer API. The query itself is a very simple SELECT * FROM table query written in standard SQL. The datasets I'm using are in the same region.
No matter how I set up the schedule options (start now, schedule start time, daily, weekly, etc.) or the destination dataset/table, I always get the same error:
"Error updating scheduled query: Request contains an invalid argument."
I have no idea which argument is invalid, it gives no more detail than that.
How do I solve this problem?
By trying to schedule the query in the classic BigQuery UI, it shows a more descriptive error which illustrates the issue:
Error in creating a new transfer: BigQuery Data Transfer Service does not yet support location northamerica-northeast1.
The data must be stored in either the US or the EU at this time, it seems.

How to execute schema(database) rename in Athena?

I am trying to execute a sql statement against Athena using sqlworkbench. I have executed several queries and know I have a connection if that is the first question. What would be the solution to renaming a database in Athena, or maybe Athena through the jdbc?
alter schema geoosm rename to geo_osm
An error occurred when executing the SQL command: alter schema
geoosm rename to geo_osm
[Simba]AthenaJDBC An error has been thrown from the AWS
Athena client. line 1:24: mismatched input 'rename' expecting 'SET'
[Execution ID not available] [SQL State=HY000, DB Errorcode=100071] 1
statement failed.
Execution time: 0.27s
My SQL syntax comes in Athena from Presto documentation which from my understanding is the syntax used by Athena.
8.1. ALTER SCHEMA Synopsis
ALTER SCHEMA name RENAME TO new_name
Sorry but there is no way to rename a database in AWS Athena. Fortunately, table data and table definition are two completely different things in Athena.
You can just create a new database with the right name, generate all DDL's for your table and execute them using the new database.
The "new" tables in the new database will still pointing to the same location so nothing to worry about.

WSO2 DAS spark script

I'm trying to deploy new data publisher car. I looked at tthe APIM_LAST_ACCESS_TIME_SCRIPT.xml spark script (used by api manager) and didn't understand the difference between the two temporaries tables created: API_LAST_ACCESS_TIME_SUMMARY_FINAL and APILastAccessSummaryData
The two Spark temporary tables represent different JDBC tables (possibly in different datasources), where one of them acts as the source for Spark and the other acts as the destination.
To illustrate this better, have a look at the simplified script in question:
create temporary table APILastAccessSummaryData using CarbonJDBC options (dataSource "WSO2AM_STATS_DB", tableName "API_LAST_ACCESS_TIME_SUMMARY", ... );
CREATE TEMPORARY TABLE API_LAST_ACCESS_TIME_SUMMARY_FINAL USING CarbonAnalytics OPTIONS (tableName "API_LAST_ACCESS_TIME_SUMMARY", ... );
INSERT INTO TABLE APILastAccessSummaryData select ... from API_LAST_ACCESS_TIME_SUMMARY_FINAL;
As you can see, we're first creating a temporary table in Spark with the name APILastAccessSummaryData, which represents an actual relational DB table with the name API_LAST_ACCESS_TIME_SUMMARY in the WSO2AM_STATS_DB datasource. Note the using CarbonJDBC keyword, which can be used to directly map JDBC tables within Spark. Such tables (and their rows) are not encoded, and can be read by the user.
Second, we're creating another Spark temporary table with the name API_LAST_ACCESS_TIME_SUMMARY_FINAL. Here however, we're using the CarbonAnalytics analytics provider, which will mean that this table will not be a vanilla JDBC table, but an encoded table similar to the one from your previous question.
Now, from the third statement, you can see that we're reading (SELECT) a number of fields from the second table API_LAST_ACCESS_TIME_SUMMARY_FINAL and inserting them (INSERT INTO) into the first, which is APILastAccessSummaryData. This represents the Spark summarisation process.
For more details on the differences between the CarbonAnalytics and CarbonJDBC analytics providers or on how Spark handles such tables in general, have a look at the documentation page for Spark Query Language.