bq extract - BigQuery error in extract operation: An internal error occurred and the request could not be completed - google-cloud-platform

I am trying to export a table from BigQuery to google storage using the following command within the console:
bq --location=<hidden> extract --destination_format CSV --compression GZIP --field_delimiter "|" --print_header=true <project>:<dataset>.<table> gs://<airflow_bucket>/data/zip/20200706_<hidden_name>.gzip
I get the following error :
BigQuery error in extract operation: An internal error occurred and the request could not be completed.
Here is some information about the said table
Table ID <HIDDEN>
Table size 6,18 GB
Number of rows 25 854 282
Created 18.06.2020, 15:26:10
Table expiration Never
Last modified 14.07.2020, 17:35:25
Data location EU
What I'm trying to do here, is extract this table into google storage. Since the table is > 1 Gb, then it gets fragmented... I want to assemble all those fragments into one archive, into a google cloud storage bucket.
What is happening here? How do I fix this?
Note: I've hidden the actual names and locations of the table & other information with the mention <hidden> or <airflow_bucket> or `:.
`

I found out the reason behind this, the documentation gives the following syntax for the bq extract
> bq --location=location extract \
> --destination_format format \
> --compression compression_type \
> --field_delimiter delimiter \
> --print_header=boolean \ project_id:dataset.table \ gs://bucket/filename.ext
I removed location=<bq_table_location> and it works on principle. Except I had to add a wildcard and I end up having multiple compressed files.

According to the public documentation, you are getting the error due to the 1 Gb file size limit.
Currently it is not possible to accomplish what you want without adding an additional step, either with concatenating on Cloud Storage, or using a Batch Job, on Dataflow as an example.
There are some Google-provided batch templates that export data from BigQuery to GCS, but none with the CSV format, so you would need to touch some code in order to do it on Dataflow.

Related

Where to find detailed logs of BigQuery Data Transfer

I am using BQ Data Transfer to move some zipped JSON data from s3 to BQ.
I am receiving the following error and I'd like to dig deeper into it.
"jsonPayload": {"message": "Job xyz (table ybz) failed with error INVALID_ARGUMENT: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details. File: gs://some-file-name; JobID: PID"},
When trying to connect that URL (replacing the gs:// part with https://storage.googleapis.com/) and I get
<Details>Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).</Details>
That storage can't be found on my GCP Storage buckets.
I suspect there are badly formatted JSON, but without clearly looking at the logs and errors I can't get back to the s3 bucket owner with relevant information.
You can refer to this document to BigQuery Data Transfer Service for Amazon S3.
When you load JSON files into BigQuery, note the following:
JSON data must be newline delimited. Each JSON object must be on a separate line in the file.
If you use gzip compression, BigQuery cannot read the data in parallel. Loading compressed JSON data into BigQuery is slower than loading uncompressed data.
You cannot include both compressed and uncompressed files in the same load job.
The maximum size for a gzip file is 4 GB.
BigQuery supports the JSON type even if schema information is not known at the time of ingestion. A field that is declared as JSON type is loaded with the raw JSON values.
For more information regarding limitations about the Amazon S3 transfers you can refer to this document.
To view logs of BigQuery Data transfers in logs explorer, you can use this filter:
resource.type="bigquery_dts_config"
labels.run_id="transfer_run_id"

BigQuery error in load operation: URI not found

I have, in the same GCP project, a BigQuery dataset and a cloud storage bucket, both within the region us-central1. The storage bucket has a single parquet file located in it. When I run the below command:
bq load \
--project_id=myProject --location=us-central1 \
--source_format=PARQUET \
myDataSet:tableName \
gs://my-storage-bucket/my_parquet.parquet
It fails with the below error:
BigQuery error in load operation: Error processing job '[job_no]': Not found: URI gs://my-storage-bucket/my_parquet.parquet
Removing the --project_id or --location tags don't affect the outcome.
Figured it out - the documentation is incorrect, I actually had to declare the source as gs://my-storage-bucket/my_parquet.parquet/part* and it loaded fine
There has been some internal issues with BigQuery on 3rd March and it has been fixed now.
I have confirmed and used the following command to upload successfully a parquet file from Cloud Storage to BigQuery Table using bq command:
bq load --project_id=PROJECT_ID \
--source_format=PARQUET \
DATASET.TABLE_NAME gs://BUCKET/FILE.parquet
Please note that according to the BigQuery Official Documentation, you have to declare the name of the table as following DATASET.TABLE_NAME ( In the post, I can see : instead of . )

BigQuery transfer service: which file causes error?

I'm trying to load around 1000 files from Google Cloud Storage into BigQuery using the BigQuery transfer service, but it appears I have an error in one of my files:
Job bqts_601e696e-0000-2ef0-812d-f403043921ec (table streams) failed with error INVALID_ARGUMENT: Error while reading data, error message: CSV table references column position 19, but line starting at position:206 contains only 19 columns.; JobID: 931777629779:bqts_601e696e-0000-2ef0-812d-f403043921ec
How can I find which file is causing this error?
I feel like this is in the docs somewhere, but I can't seem to find it.
Thanks!
You can use bq show --format=prettyjson -j job_id_here and will show a verbose error about the failed job. You can see more info about the usage of the command in BigQuery managing jobs docs.
I tried this with a failed job of mine wherein I'm loading csv files from a Google Coud Storage bucket in my project.
Command used:
bq show --format=prettyjson -j bqts_xxxx-xxxx-xxxx-xxxx
Here is a snippet of the output. Output is in JSON format:

Google BigQuery cannot read some ORC data

I'm trying to load ORC data files stored in GCS into BigQuery via bq load/bq mk and facing an error below. The data files copied via hadoop discp command from on-prem cluster's Hive instance version 1.2. Most of the orc-files are loaded successfully, but few are not. There is no problem when I read this data from Hive.
Command I used:
$ bq load --source_format ORC hadoop_migration.pm hive/part-v006-o000-r-00000_a_17
Upload complete.
Waiting on bqjob_r7233761202886bd8_00000175f4b18a74_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '<project>-af9bd5f6:bqjob_r7233761202886bd8_00000175f4b18a74_1': Error while reading data, error message:
The Apache Orc library failed to parse metadata of stripes with error: failed to open /usr/share/zoneinfo/GMT-00:00 - No such file or directory
Indeed, there is no such file and I believe it shouldn't be.
Google doesn't know about this error message but I've found similar problem here: https://issues.apache.org/jira/browse/ARROW-4966. There is a workaround for on-prem servers of creating sym-link to /usr/share/zoneinfo/GMT-00:00. But I'm in a Cloud.
Additionally, I found that if I extract data from orc file via orc-tools into json format I'm able to load that json file into BigQuery. So I suspect that the problem not in the data itself.
Does anybody came across such problem?
Official Google support position below. In short BigQuery doesn't understand some timezone's description and we suggested to change it in the data. Our workaround for this was to convert ORC data to parquet and then load it into table.
Indeed this error can happen. Also when you try to execute a query from the BigQuery Cloud Console such as:
select timestamp('2020-01-01 00:00:00 GMT-00:00')
you’ll get the same error. It is not just related to the ORC import, it’s how BigQuery understands timestamps. BigQuery supports a wide range of representations as described in [1]. So:
“2020-01-01 00:00:00 GMT-00:00” -- incorrect timestamp string literal
“2020-01-01 00:00:00 abcdef” -- incorrect timestamp string literal
“2020-01-01 00:00:00-00:00” -- correct timestamp string literal
In your case the problem is with the representation of the time zone within the ORC file. I suppose it was generated that way by some external system. If you were able to get the “GMT-00:00” string with preceding space replaced with just “-00:00” that would be the correct name of the time zone. Can you change the configuration of the system which generated the file into having a proper time zone string?
Creating a symlink is only masking the problem and not solving it properly. In case of BigQuery it is not possible.
Best regards,
Google Cloud Support
[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#time_zones

Cloud DataFlow SQL from BigQuery UI cannot read Cloud Storage filesets: "Table not found: datacatalog.entry"

I'm trying to create a Data Flow job using the beta Cloud DataFlow SQL within Google Big Query UI.
My data source is a Cloud Storage Fileset (that is a set of files in Cloud Storage defined through a Data Catalog).
Following GCP documentation, I was able to define my fileset, assign it a schema and visualize it in the Resources tab of Big Query UI.
But then I cannot launch any Dataflow job in the Query Editor, because I get the following error message in the query validator: Table not found: datacatalog.entry.location.entry_group.fileset_name...
Is it an issue of some APIs not authorized?
Thanks for your help!
You may be using the wrong location in the full path. When your create a Data Catalog Fileset, check the location you provided, i.e: using the sales regions example from the docs:
gcloud data-catalog entries create us_state_salesregions \
--location=us-central1 \
--entry-group=dataflow_sql_dataset \
--type=FILESET \
--gcs-file-patterns=gs://us_state_salesregions_{my_project}/*.csv \
--schema-from-file=schema_file.json \
--description="US State Sales regions..."
When you are building your DataFlow SQL query:
SELECT tr.*, sr.sales_region
FROM pubsub.topic.`project-id`.transactions as tr
INNER JOIN
datacatalog.entry.`project-id`.`us-central1`.dataflow_sql_dataset.us_state_salesregions AS sr
ON tr.state = sr.state_code
Check the full path, it should look like the example above:
datacatalog.entry, then your location - in this example is us-central1, next your project-id, next your entry group id - in this example dataflow_sql_dataset, next your entry id - in this example us_state_salesregions
let me know if this works for you.