Cloud DataFlow SQL from BigQuery UI cannot read Cloud Storage filesets: "Table not found: datacatalog.entry" - google-cloud-platform

I'm trying to create a Data Flow job using the beta Cloud DataFlow SQL within Google Big Query UI.
My data source is a Cloud Storage Fileset (that is a set of files in Cloud Storage defined through a Data Catalog).
Following GCP documentation, I was able to define my fileset, assign it a schema and visualize it in the Resources tab of Big Query UI.
But then I cannot launch any Dataflow job in the Query Editor, because I get the following error message in the query validator: Table not found: datacatalog.entry.location.entry_group.fileset_name...
Is it an issue of some APIs not authorized?
Thanks for your help!

You may be using the wrong location in the full path. When your create a Data Catalog Fileset, check the location you provided, i.e: using the sales regions example from the docs:
gcloud data-catalog entries create us_state_salesregions \
--location=us-central1 \
--entry-group=dataflow_sql_dataset \
--type=FILESET \
--gcs-file-patterns=gs://us_state_salesregions_{my_project}/*.csv \
--schema-from-file=schema_file.json \
--description="US State Sales regions..."
When you are building your DataFlow SQL query:
SELECT tr.*, sr.sales_region
FROM pubsub.topic.`project-id`.transactions as tr
INNER JOIN
datacatalog.entry.`project-id`.`us-central1`.dataflow_sql_dataset.us_state_salesregions AS sr
ON tr.state = sr.state_code
Check the full path, it should look like the example above:
datacatalog.entry, then your location - in this example is us-central1, next your project-id, next your entry group id - in this example dataflow_sql_dataset, next your entry id - in this example us_state_salesregions
let me know if this works for you.

Related

Error Bigquery/dataflow "Could not resolve table in Data Catalog"

I'm having troubles with a job I've set up on dataflow.
Here is the context, I created a dataset on bigquery using the following path
bi-training-gcp:sales.sales_data
In the properties I can see that the data location is "US"
Now I want to run a job on dataflow and I enter the following command into the google shell
gcloud dataflow sql query ' SELECT country, DATE_TRUNC(ORDERDATE , MONTH),
sum(sales) FROM bi-training-gcp.sales.sales_data group by 1,2 ' --job-name=dataflow-sql-sales-monthly --region=us-east1 --bigquery-dataset=sales --bigquery-table=monthly_sales
The query is accepted by the console and returns me a sort of acceptation message.
After that I go to the dataflow dashboard. I can see a new job as queued but after 5 minutes or so the job fails and I get the following error messages:
Error
2021-09-29T18:06:00.795ZInvalid/unsupported arguments for SQL job launch: Invalid table specification in Data Catalog: Could not resolve table in Data Catalog: bi-training-gcp.sales.sales_data
Error 2021-09-29T18:10:31.592036462ZError occurred in the launcher
container: Template launch failed. See console logs.
My guess is that it cannot find my table. Maybe because I specified the wrong location/region, since my table is specified to be location in "US" I thought it would be on a US server (which is why I specified us-east1 as a region), but I tried all us regions with no success...
Does anybody know how I can solve this ?
Thank you
This error occurs if the Dataflow service account doesn't have access to the Data Catalog API. To resolve this issue, enable the Data Catalog API in the Google Cloud project that you're using to write and run queries. Alternately, assign the roles/datacatalog.

BigQuery error in load operation: URI not found

I have, in the same GCP project, a BigQuery dataset and a cloud storage bucket, both within the region us-central1. The storage bucket has a single parquet file located in it. When I run the below command:
bq load \
--project_id=myProject --location=us-central1 \
--source_format=PARQUET \
myDataSet:tableName \
gs://my-storage-bucket/my_parquet.parquet
It fails with the below error:
BigQuery error in load operation: Error processing job '[job_no]': Not found: URI gs://my-storage-bucket/my_parquet.parquet
Removing the --project_id or --location tags don't affect the outcome.
Figured it out - the documentation is incorrect, I actually had to declare the source as gs://my-storage-bucket/my_parquet.parquet/part* and it loaded fine
There has been some internal issues with BigQuery on 3rd March and it has been fixed now.
I have confirmed and used the following command to upload successfully a parquet file from Cloud Storage to BigQuery Table using bq command:
bq load --project_id=PROJECT_ID \
--source_format=PARQUET \
DATASET.TABLE_NAME gs://BUCKET/FILE.parquet
Please note that according to the BigQuery Official Documentation, you have to declare the name of the table as following DATASET.TABLE_NAME ( In the post, I can see : instead of . )

bq extract - BigQuery error in extract operation: An internal error occurred and the request could not be completed

I am trying to export a table from BigQuery to google storage using the following command within the console:
bq --location=<hidden> extract --destination_format CSV --compression GZIP --field_delimiter "|" --print_header=true <project>:<dataset>.<table> gs://<airflow_bucket>/data/zip/20200706_<hidden_name>.gzip
I get the following error :
BigQuery error in extract operation: An internal error occurred and the request could not be completed.
Here is some information about the said table
Table ID <HIDDEN>
Table size 6,18 GB
Number of rows 25 854 282
Created 18.06.2020, 15:26:10
Table expiration Never
Last modified 14.07.2020, 17:35:25
Data location EU
What I'm trying to do here, is extract this table into google storage. Since the table is > 1 Gb, then it gets fragmented... I want to assemble all those fragments into one archive, into a google cloud storage bucket.
What is happening here? How do I fix this?
Note: I've hidden the actual names and locations of the table & other information with the mention <hidden> or <airflow_bucket> or `:.
`
I found out the reason behind this, the documentation gives the following syntax for the bq extract
> bq --location=location extract \
> --destination_format format \
> --compression compression_type \
> --field_delimiter delimiter \
> --print_header=boolean \ project_id:dataset.table \ gs://bucket/filename.ext
I removed location=<bq_table_location> and it works on principle. Except I had to add a wildcard and I end up having multiple compressed files.
According to the public documentation, you are getting the error due to the 1 Gb file size limit.
Currently it is not possible to accomplish what you want without adding an additional step, either with concatenating on Cloud Storage, or using a Batch Job, on Dataflow as an example.
There are some Google-provided batch templates that export data from BigQuery to GCS, but none with the CSV format, so you would need to touch some code in order to do it on Dataflow.

Get all my scheduled SQL queries in BigQuery Google Cloud

Im trying to get SQL codes by command-line (CLI) of my scheduled queries in BigQuery. I'm also interested if there is a way to do that by the Google Cloud Platform user interface.
I have taken a quick look to this related post, but that's not the answer that I am looking for.
List Scheduled Queries in BigQuery
Thank you in advance for all your answers.
I found how to query the scheduled queries with the bq CLI. You have to rely on the BigQuery Transfer API. Why? I don't know, but it's the right keyword here.
For listing all your schedule query, perform this (change your location if you want!):
bq ls --transfer_config --transfer_location=eu
# Result
name displayName dataSourceId state
--------------------------------------------------------------------------------------------- ------------- ----------------- -------
projects/763366003587/locations/europe/transferConfigs/5de1fc66-0000-20f2-bee7-089e082935bc test scheduled_query
For viewing the detail, copy the name and use bq show
bq show --transfer_config \
projects/763366003587/locations/europe/transferConfigs/5de1fc66-0000-20f2-bee7-089e082935bc
# Result
updateTime destinationDatasetId displayName schedule datasetRegion userId scheduleOptions dataSourceId
params
----------------------------- ---------------------- ------------- ----------------- --------------- ---------------------- -------------------------------------------------------------------------------------- ----------------- --------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
2019-11-18T20:20:22.279237Z bi_data test every day 20:19 europe -7444165337568771239 {u'endTime': u'2019-11-18T21:19:36.528Z', u'startTime': u'2019-11-18T20:19:36.497Z'} scheduled_query {u'query': u'
SELECT * FROM `gbl-imt-homerider-basguillaueb.bi_data.device_states`', u'write_disposition': u'WRITE_TRUNCATE', u'destination_table_name_template': u'test_schedule'}
You can use json format and jq for getting only the query like this
bq show --format="json" --transfer_config \
projects/763366003587/locations/europe/transferConfigs/5de1fc66-0000-20f2-bee7-089e082935bc \
| jq '.params.query'
# Result
"SELECT * FROM `gbl-imt-homerider-basguillaueb.bi_data.device_states`"
I can explain how I found this unexpected solution that if you want, but it's not the topic here. I think it's not documented
On the GUI, it's easier.
Go to BigQuery (new UI, in blue)
Click on scheduled query on the left menu
Click on your scheduled query name
Click on configuration on the top on the screen
To get your scheduled queries (= datatransfers), you can also use the python API:
from google.cloud import bigquery_datatransfer
bq_datatransfer_client = bigquery_datatransfer.DataTransferServiceClient()
request_datatransfers = bigquery_datatransfer.ListTransferConfigsRequest(
# if US, you can just do parent='projects/YOUR_PROJECT_ID'
parent='projects/YOUR_PROJECT_ID/locations/EU',
)
# this method will also deal with pagination
response_datatransfers = bq_datatransfer_client.list_transfer_configs(
request=request_datatransfers)
# to convert the response to a list of scheduled queries
datatransfers = list(response_datatransfers)
To get the actual query text from the scheduled query:
for datatransfer in datatransfers:
print(datatransfer.display_name)
print(datatransfer.params.get('query'))
print('\n')
See also these SO questions:
How do I list my scheduled queries via the Python google client API?
List Scheduled Queries in BigQuery
Docs on this specific part of the python API:
https://cloud.google.com/python/docs/reference/bigquerydatatransfer/latest/google.cloud.bigquery_datatransfer_v1.services.data_transfer_service.DataTransferServiceClient#google_cloud_bigquery_datatransfer_v1_services_data_transfer_service_DataTransferServiceClient_list_transfer_configs

Get the BigQuery Table creator and Google Storage Bucket Creator Details

I am trying to identify the users who created tables in BigQuery.
Is there any command line or API that would provide this information. I know that audit logs do provide this information, but I was looking for a command line which could do the job so that i could wrap this in a shell script and run them against all the tables at one time. Same for Google Storage Buckets as well. I did try
gsutil iam get gs://my-bkt and looked for "role": "roles/storage.admin" role, but I do not find the admin role with all buckets. Any help?
This is a use case for audit logs. BigQuery tables don't report metadata about the original resource creator, so scanning via tables.list or inspecting the ACLs don't really expose who created the resource, only who currently has access.
What's the use case? You could certainly export the audit logs back into BigQuery and query for table creation events going forward, but that's not exactly the same.
You can find it out using Audit Logs. You can access them both via Console/Log Explorer or using gcloud tool from the CLI.
The log filter that you're interested in is this one:
resource.type = ("bigquery_project" OR "bigquery_dataset")
logName="projects/YOUR_PROJECT/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.methodName = "google.cloud.bigquery.v2.TableService.InsertTable"
protoPayload.resourceName = "projects/YOUR_PROJECT/datasets/curb_tracking/tables/YOUR_TABLE"
If you want to run it from the command line, you'd do something like this:
gcloud logging read \
'
resource.type = ("bigquery_project" OR "bigquery_dataset")
logName="projects/YOUR_PROJECT/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.methodName = "google.cloud.bigquery.v2.TableService.InsertTable"
protoPayload.resourceName = "projects/YOUR_PROJECT/datasets/curb_tracking/tables/YOUR_TABLE"
'\
--limit 10
You can then post-process the output to find out who created the table. Look for principalEmail field.