BigQuery error in load operation: URI not found - google-cloud-platform

I have, in the same GCP project, a BigQuery dataset and a cloud storage bucket, both within the region us-central1. The storage bucket has a single parquet file located in it. When I run the below command:
bq load \
--project_id=myProject --location=us-central1 \
--source_format=PARQUET \
myDataSet:tableName \
gs://my-storage-bucket/my_parquet.parquet
It fails with the below error:
BigQuery error in load operation: Error processing job '[job_no]': Not found: URI gs://my-storage-bucket/my_parquet.parquet
Removing the --project_id or --location tags don't affect the outcome.

Figured it out - the documentation is incorrect, I actually had to declare the source as gs://my-storage-bucket/my_parquet.parquet/part* and it loaded fine

There has been some internal issues with BigQuery on 3rd March and it has been fixed now.
I have confirmed and used the following command to upload successfully a parquet file from Cloud Storage to BigQuery Table using bq command:
bq load --project_id=PROJECT_ID \
--source_format=PARQUET \
DATASET.TABLE_NAME gs://BUCKET/FILE.parquet
Please note that according to the BigQuery Official Documentation, you have to declare the name of the table as following DATASET.TABLE_NAME ( In the post, I can see : instead of . )

Related

BigQuery transfer service: which file causes error?

I'm trying to load around 1000 files from Google Cloud Storage into BigQuery using the BigQuery transfer service, but it appears I have an error in one of my files:
Job bqts_601e696e-0000-2ef0-812d-f403043921ec (table streams) failed with error INVALID_ARGUMENT: Error while reading data, error message: CSV table references column position 19, but line starting at position:206 contains only 19 columns.; JobID: 931777629779:bqts_601e696e-0000-2ef0-812d-f403043921ec
How can I find which file is causing this error?
I feel like this is in the docs somewhere, but I can't seem to find it.
Thanks!
You can use bq show --format=prettyjson -j job_id_here and will show a verbose error about the failed job. You can see more info about the usage of the command in BigQuery managing jobs docs.
I tried this with a failed job of mine wherein I'm loading csv files from a Google Coud Storage bucket in my project.
Command used:
bq show --format=prettyjson -j bqts_xxxx-xxxx-xxxx-xxxx
Here is a snippet of the output. Output is in JSON format:

bq extract - BigQuery error in extract operation: An internal error occurred and the request could not be completed

I am trying to export a table from BigQuery to google storage using the following command within the console:
bq --location=<hidden> extract --destination_format CSV --compression GZIP --field_delimiter "|" --print_header=true <project>:<dataset>.<table> gs://<airflow_bucket>/data/zip/20200706_<hidden_name>.gzip
I get the following error :
BigQuery error in extract operation: An internal error occurred and the request could not be completed.
Here is some information about the said table
Table ID <HIDDEN>
Table size 6,18 GB
Number of rows 25 854 282
Created 18.06.2020, 15:26:10
Table expiration Never
Last modified 14.07.2020, 17:35:25
Data location EU
What I'm trying to do here, is extract this table into google storage. Since the table is > 1 Gb, then it gets fragmented... I want to assemble all those fragments into one archive, into a google cloud storage bucket.
What is happening here? How do I fix this?
Note: I've hidden the actual names and locations of the table & other information with the mention <hidden> or <airflow_bucket> or `:.
`
I found out the reason behind this, the documentation gives the following syntax for the bq extract
> bq --location=location extract \
> --destination_format format \
> --compression compression_type \
> --field_delimiter delimiter \
> --print_header=boolean \ project_id:dataset.table \ gs://bucket/filename.ext
I removed location=<bq_table_location> and it works on principle. Except I had to add a wildcard and I end up having multiple compressed files.
According to the public documentation, you are getting the error due to the 1 Gb file size limit.
Currently it is not possible to accomplish what you want without adding an additional step, either with concatenating on Cloud Storage, or using a Batch Job, on Dataflow as an example.
There are some Google-provided batch templates that export data from BigQuery to GCS, but none with the CSV format, so you would need to touch some code in order to do it on Dataflow.

Cloud DataFlow SQL from BigQuery UI cannot read Cloud Storage filesets: "Table not found: datacatalog.entry"

I'm trying to create a Data Flow job using the beta Cloud DataFlow SQL within Google Big Query UI.
My data source is a Cloud Storage Fileset (that is a set of files in Cloud Storage defined through a Data Catalog).
Following GCP documentation, I was able to define my fileset, assign it a schema and visualize it in the Resources tab of Big Query UI.
But then I cannot launch any Dataflow job in the Query Editor, because I get the following error message in the query validator: Table not found: datacatalog.entry.location.entry_group.fileset_name...
Is it an issue of some APIs not authorized?
Thanks for your help!
You may be using the wrong location in the full path. When your create a Data Catalog Fileset, check the location you provided, i.e: using the sales regions example from the docs:
gcloud data-catalog entries create us_state_salesregions \
--location=us-central1 \
--entry-group=dataflow_sql_dataset \
--type=FILESET \
--gcs-file-patterns=gs://us_state_salesregions_{my_project}/*.csv \
--schema-from-file=schema_file.json \
--description="US State Sales regions..."
When you are building your DataFlow SQL query:
SELECT tr.*, sr.sales_region
FROM pubsub.topic.`project-id`.transactions as tr
INNER JOIN
datacatalog.entry.`project-id`.`us-central1`.dataflow_sql_dataset.us_state_salesregions AS sr
ON tr.state = sr.state_code
Check the full path, it should look like the example above:
datacatalog.entry, then your location - in this example is us-central1, next your project-id, next your entry group id - in this example dataflow_sql_dataset, next your entry id - in this example us_state_salesregions
let me know if this works for you.

How to schedule BigQuery DataTransfer Service using bq command

I am trying to create a Data Transfer service using BigQuery. I used bq command to create the DTS,
I am able to create DTS successfully
I need to specify custom time for scheduling using the bq command
Is it possible to schedule custom time while creating the Data Transfer service. Refer sample bq command
bq mk --transfer_config \
--project_id='My project' \
--target_dataset='My Dataset' \
--display_name='test_bqdts' \
--params='{"data_path":<data_path>,
"destination_table_name_template":<destination_table_name>,
"file_format":<>,
"ignore_unknown_values":"true",
"access_key_id": "access_key_id",
"secret_access_key": "secret_access_key"
}' \
--data_source=data_source_id
NOTE: When you create an Amazon S3 transfer using the command-line tool, the transfer configuration is set up using the default value for Schedule (every 24 hours).
You can use the flag --schedule as you can see here
Option 2: Use the bq mk command.
Scheduled queries are a kind of transfer. To schedule a query, you can
use the BigQuery Data Transfer Service CLI to make a transfer
configuration.
Queries must be in StandardSQL dialect to be scheduled.
Enter the bq mk command and supply the transfer creation flag
--transfer_config. The following flags are also required:
--data_source
--target_dataset (Optional for DDL/DML queries.)
--display_name
--params
Optional flags:
--project_id is your project ID. If --project_id isn't specified, the default project is used.
--schedule is how often you want the query to run. If --schedule isn't specified, the default is 'every 24 hours' based on creation
time.
For DDL/DML queries, you can also supply the --location flag to specify a particular region for processing. If --location isn't
specified, the global Google Cloud location is used.
--service_account_name is for authenticating your scheduled query with a service account instead of your individual user account. Note:
Using service accounts with scheduled queries is in beta.
bq mk \
--transfer_config \
--project_id=project_id \
--target_dataset=dataset \
--display_name=name \
--params='parameters' \
--data_source=data_source
If you want to set a 24 hours schedule, for example, you should use --schedule='every 24 hours'
You can find the complete reference for the time syntax here
I hope it helps

Can I use Athena View as a source for a AWS Glue Job?

I'm trying to use an Athena View as a data source to my AWS Glue Job. The error message I'm getting while trying to run the Glue job is about the classification of the view. What can I define it as?
Thank you
Error Message Appearing
You can by using the Athena JDBC driver. This approach circumvents the catalog, as only Athena (and not Glue as of 25-Jan-2019) can directly access views.
Download the driver and store the jar to an S3 bucket.
Specify the S3 path to the driver as a dependent jar in your job definition.
Load the data into a dynamic frame using the code below (using an IAM user
with permission to run Athena queries).
from awsglue.dynamicframe import DynamicFrame
# ...
athena_view_dataframe = (
glueContext.read.format("jdbc")
.option("user", "[IAM user access key]")
.option("password", "[IAM user secret access key]")
.option("driver", "com.simba.athena.jdbc.Driver")
.option("url", "jdbc:awsathena://athena.us-east-1.amazonaws.com:443")
.option("dbtable", "my_database.my_athena_view")
.option("S3OutputLocation","s3://bucket/temp/folder") # CSVs/metadata dumped here on load
.load()
)
athena_view_datasource = DynamicFrame.fromDF(athena_view_dataframe, glueContext, "athena_view_source")
The driver docs (pdf) provide alternatives to IAM user auth (e.g. SAML, custom provider).
The main side effect to this approach is that loading causes the query results to be dumped in CSV format to the bucket specified with the S3OutputLocation key.
I don't believe that you can create a Glue Connection to Athena via JDBC because you can't specify an S3 path to the driver location.
Attribution: AWS support totally helped me get this working.