Merge bigquery table with google cloud postgres table with federated query - google-cloud-platform

I am trying to merge a bigquery table (target) with a google cloud postgres table (source) with a federated query. However it looks like bigquery will not accept a federated query in the "using" clause.
Syntax error: Expected "(" or keyword SELECT or keyword WITH but got identifier "EXTERNAL_QUERY" at [3:9]
My query looks something like the below.
MERGE bigquery_dataset.bigquery_table TARGET
USING (
EXTERNAL_QUERY("projects/company-co/locations/us/connections/company","SELECT * FROM postgres_schema.postgres_table")
) SOURCE
ON target.id = source.id
WHEN MATCHED THEN ...
WHEN NOT MATCHED BY TARGET THEN ...
WHEN NOT MATCHED BY SOURCE THEN ...
Are there any known workarounds for this type of functionality? Or is there any other way to perform this type of merge?

As per your requirement if you want to run federated queries in BigQuery where your external data source is located in Cloud PostgreSQL instance, you need to define the source dataset using the SQL function i.e EXTERNAL_QUERY
The error you are getting : “Syntax error: Expected "(" or keyword SELECT or keyword WITH but got identifier "EXTERNAL_QUERY" at [3:9]” is because you are missing out the SELECT statement before your EXTERNAL_QUERY.
As per this doc, the syntax should be :
SELECT * FROM EXTERNAL_QUERY(connection_id, external_database_query[, options]);
I tried running the federated query in BigQuery where the source is in Cloud PostgreSQL and it is working as expected.
SQL QUERY :
MERGE myproject.demo.tab1 TARGET
USING (
select * from EXTERNAL_QUERY("projects/myproject/locations/us-central1/connections/sqltobig", "SELECT * FROM entries;")
) SOURCE
ON target.entryID = source.entryID
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED THEN
INSERT(guestName, content, entryID)
VALUES(guestName, content, entryID)

Related

Error Bigquery/dataflow "Could not resolve table in Data Catalog"

I'm having troubles with a job I've set up on dataflow.
Here is the context, I created a dataset on bigquery using the following path
bi-training-gcp:sales.sales_data
In the properties I can see that the data location is "US"
Now I want to run a job on dataflow and I enter the following command into the google shell
gcloud dataflow sql query ' SELECT country, DATE_TRUNC(ORDERDATE , MONTH),
sum(sales) FROM bi-training-gcp.sales.sales_data group by 1,2 ' --job-name=dataflow-sql-sales-monthly --region=us-east1 --bigquery-dataset=sales --bigquery-table=monthly_sales
The query is accepted by the console and returns me a sort of acceptation message.
After that I go to the dataflow dashboard. I can see a new job as queued but after 5 minutes or so the job fails and I get the following error messages:
Error
2021-09-29T18:06:00.795ZInvalid/unsupported arguments for SQL job launch: Invalid table specification in Data Catalog: Could not resolve table in Data Catalog: bi-training-gcp.sales.sales_data
Error 2021-09-29T18:10:31.592036462ZError occurred in the launcher
container: Template launch failed. See console logs.
My guess is that it cannot find my table. Maybe because I specified the wrong location/region, since my table is specified to be location in "US" I thought it would be on a US server (which is why I specified us-east1 as a region), but I tried all us regions with no success...
Does anybody know how I can solve this ?
Thank you
This error occurs if the Dataflow service account doesn't have access to the Data Catalog API. To resolve this issue, enable the Data Catalog API in the Google Cloud project that you're using to write and run queries. Alternately, assign the roles/datacatalog.

Cloud DataFlow SQL from BigQuery UI cannot read Cloud Storage filesets: "Table not found: datacatalog.entry"

I'm trying to create a Data Flow job using the beta Cloud DataFlow SQL within Google Big Query UI.
My data source is a Cloud Storage Fileset (that is a set of files in Cloud Storage defined through a Data Catalog).
Following GCP documentation, I was able to define my fileset, assign it a schema and visualize it in the Resources tab of Big Query UI.
But then I cannot launch any Dataflow job in the Query Editor, because I get the following error message in the query validator: Table not found: datacatalog.entry.location.entry_group.fileset_name...
Is it an issue of some APIs not authorized?
Thanks for your help!
You may be using the wrong location in the full path. When your create a Data Catalog Fileset, check the location you provided, i.e: using the sales regions example from the docs:
gcloud data-catalog entries create us_state_salesregions \
--location=us-central1 \
--entry-group=dataflow_sql_dataset \
--type=FILESET \
--gcs-file-patterns=gs://us_state_salesregions_{my_project}/*.csv \
--schema-from-file=schema_file.json \
--description="US State Sales regions..."
When you are building your DataFlow SQL query:
SELECT tr.*, sr.sales_region
FROM pubsub.topic.`project-id`.transactions as tr
INNER JOIN
datacatalog.entry.`project-id`.`us-central1`.dataflow_sql_dataset.us_state_salesregions AS sr
ON tr.state = sr.state_code
Check the full path, it should look like the example above:
datacatalog.entry, then your location - in this example is us-central1, next your project-id, next your entry group id - in this example dataflow_sql_dataset, next your entry id - in this example us_state_salesregions
let me know if this works for you.

use SQL Workbench import csv file to AWS Redshift Database

I'm look for a manual and automatic way to use SQL Workbench to import/load a LOCAL csv file to a AWS Redshift database.
The manual way could be a way that click a navigation bar and select a option.
The automatic way could be some query codes to load the data, just run it.
here's my attempt:
there's an error "my target table in AWS is not found." but I'm sure the table exists, anyone know why?
WbImport -type=text
-file ='C:\myfile.csv'
-delimiter = ,
-table = public.data_table_in_AWS
-quoteChar=^
-continueOnError=true
-multiLine=true
You can use wbimport in SQL Workbench/J to import data
For more info : http://www.sql-workbench.net/manual/command-import.html
Like it was mentioned in the comments COPY command provided by Redshift is the optimal solution. You can use copy from S3, EC2 etc.
S3 Example:
copy <your_table>
from 's3://<bucket>/<file>'
access_key_id 'XXXX'
secret_access_key 'XXXX'
region '<your_region>'
delimiter '\t';
For more examples:
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html

AWS SimpleDB CLI: How to use the 'select' command?

I'm trying to use the select command of AWS SimpleDB from AWS CLI.
The required call is as follows:
select
--select-expression <value>
with select-expression being described as follows: --select-expression (string) The expression used to query the domain.
The select is supposed to be similar to the SQL select statement, however I keep getting errors about the syntax, e.g.:
aws sdb select --select-expression "select * from my-domain"
An error occurred (InvalidQueryExpression) when calling the Select operation: The specified query expression syntax is not valid.
I can't find any documentation or example about the right syntax to use, either.
I found the solution - turns out I needed to use single quotes for the query and special character around the table name:
aws sdb select --select-expression 'select * from `my-domain`'

How to export SQL Output directly to CSV on Amazon RDS

Amazon doesn't give Access to RDS Server directly ( they expose it only through service RDS) hence, "select into outfile" doesn't work..
Even the master user does not have privileges of FILE.
I created ticket with Amazon; talked at length with them.. They suggested few work-around like using Data Pipeline etc.. but all are too complicated..
Surely one of the way could be to use tool like MYSql Workbench --> execute query --> Export to CSV.. Only problem with this approach is that you need to execute same query twice on server and is problematic if your output is having thousands of rows.
Just write the query in a file a.sql. The SQL Should be in this format:
select concat( '"',Product_id,'","', Subcategory,'","', ifnull(Product_type,''),'","', ifnull(End_Date,''), '"') as data from tablename
mysql -h xyz.abc7zdltfa3r.ap-southeast-1.rds.amazonaws.com -u query -pxyz < a.sql > deepak.csv
Output will be there in file deepak.csv