I am using BQ Data Transfer to move some zipped JSON data from s3 to BQ.
I am receiving the following error and I'd like to dig deeper into it.
"jsonPayload": {"message": "Job xyz (table ybz) failed with error INVALID_ARGUMENT: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details. File: gs://some-file-name; JobID: PID"},
When trying to connect that URL (replacing the gs:// part with https://storage.googleapis.com/) and I get
<Details>Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).</Details>
That storage can't be found on my GCP Storage buckets.
I suspect there are badly formatted JSON, but without clearly looking at the logs and errors I can't get back to the s3 bucket owner with relevant information.
You can refer to this document to BigQuery Data Transfer Service for Amazon S3.
When you load JSON files into BigQuery, note the following:
JSON data must be newline delimited. Each JSON object must be on a separate line in the file.
If you use gzip compression, BigQuery cannot read the data in parallel. Loading compressed JSON data into BigQuery is slower than loading uncompressed data.
You cannot include both compressed and uncompressed files in the same load job.
The maximum size for a gzip file is 4 GB.
BigQuery supports the JSON type even if schema information is not known at the time of ingestion. A field that is declared as JSON type is loaded with the raw JSON values.
For more information regarding limitations about the Amazon S3 transfers you can refer to this document.
To view logs of BigQuery Data transfers in logs explorer, you can use this filter:
resource.type="bigquery_dts_config"
labels.run_id="transfer_run_id"
Related
I am referring to this article of creating Cloud Dataprep pipeline
When following the step of importing the data while creating flow, I am not able to read the data and its says access denied as per the above screenshot.
Reference Link : https://www.trifacta.com/blog/data-quality-monitoring-for-cloud-dataprep-pipelines/
I tried importing the json file and I am expecting the flow to read the table
I'm trying to load ORC data files stored in GCS into BigQuery via bq load/bq mk and facing an error below. The data files copied via hadoop discp command from on-prem cluster's Hive instance version 1.2. Most of the orc-files are loaded successfully, but few are not. There is no problem when I read this data from Hive.
Command I used:
$ bq load --source_format ORC hadoop_migration.pm hive/part-v006-o000-r-00000_a_17
Upload complete.
Waiting on bqjob_r7233761202886bd8_00000175f4b18a74_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '<project>-af9bd5f6:bqjob_r7233761202886bd8_00000175f4b18a74_1': Error while reading data, error message:
The Apache Orc library failed to parse metadata of stripes with error: failed to open /usr/share/zoneinfo/GMT-00:00 - No such file or directory
Indeed, there is no such file and I believe it shouldn't be.
Google doesn't know about this error message but I've found similar problem here: https://issues.apache.org/jira/browse/ARROW-4966. There is a workaround for on-prem servers of creating sym-link to /usr/share/zoneinfo/GMT-00:00. But I'm in a Cloud.
Additionally, I found that if I extract data from orc file via orc-tools into json format I'm able to load that json file into BigQuery. So I suspect that the problem not in the data itself.
Does anybody came across such problem?
Official Google support position below. In short BigQuery doesn't understand some timezone's description and we suggested to change it in the data. Our workaround for this was to convert ORC data to parquet and then load it into table.
Indeed this error can happen. Also when you try to execute a query from the BigQuery Cloud Console such as:
select timestamp('2020-01-01 00:00:00 GMT-00:00')
you’ll get the same error. It is not just related to the ORC import, it’s how BigQuery understands timestamps. BigQuery supports a wide range of representations as described in [1]. So:
“2020-01-01 00:00:00 GMT-00:00” -- incorrect timestamp string literal
“2020-01-01 00:00:00 abcdef” -- incorrect timestamp string literal
“2020-01-01 00:00:00-00:00” -- correct timestamp string literal
In your case the problem is with the representation of the time zone within the ORC file. I suppose it was generated that way by some external system. If you were able to get the “GMT-00:00” string with preceding space replaced with just “-00:00” that would be the correct name of the time zone. Can you change the configuration of the system which generated the file into having a proper time zone string?
Creating a symlink is only masking the problem and not solving it properly. In case of BigQuery it is not possible.
Best regards,
Google Cloud Support
[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#time_zones
I have a task to analyze weather forecast data in Quicksight. The forecast data is held in NetCDF binary files in a public S3 bucket. The question is: how do you expose the contents of these binary files to Quicksight or even Athena?
There are python libraries that will decode the data from the binary files, such as Iris. They are used like this:
import iris
filename = iris.sample_data_path('forecast_20200304.nc')
cubes = iris.load(filename)
print(cubes)
So what would be the AWS workflow and services necessary to create a data ingestion pipeline that would:
Respond to an SQS message that a new binary file is available
Access the new binary file and decode it to access the forecast data
Add the decoded data to the set of already decoded data from previous SQS notifications
Make all the decoded data available in Athena / Quicksight
Tricky one, this...
What I would do is probably something like this:
Write a Lambda function in Python that is triggered when new files appear in the S3 bucket – either by S3 notifications (if you control the bucket), by SNS, SQS, or by schedule in EventBridge. The function uses the code snipplet included in your question to transform each new file and upload the transformed data to another S3 bucket.
I don't know the size of these files and how often they are published, so whether to convert to CSV, JSON, or Parquet is something you have to decide – if the data is small CSV will probably be easiest and will be good enough.
With the converted data in a new S3 bucket all you need to do is create an Athena table for the data set and start using QuickSight.
If you end up with a lot of small files you might want to implement a second step where you once per day combine the converted files into bigger files, and possibly Parquet, but don't do anything like that unless you have to.
An alternative way would be to use Athena Federated Query: by implementing Lambda function(s) that respond to specific calls from Athena you can make Athena read any data source that you want. It's currently in preview, and as far as I know all the example code is written in Java – but theoretically it would be possible to write the Lambda functions in Python.
I'm not sure whether it would be less work than implementing an ETL workflow like the one you suggest, but yours is one of the use cases for which Athena Federated Query was designed for and it might be worth looking into. If NetCDF files are common and a data source for such files would be useful for other people I'm sure the Athena team would love to talk to you and help you out.
I have seen lot of examples of Apache Beam where you read data from PubSub and write to GCS bucket, however is there any example of using KafkaIO and writing it to GCS bucket?
Where I can parse the message and put it in appropriate bucket based on the message content?
For e.g.
message = {type="type_x", some other attributes....}
message = {type="type_y", some other attributes....}
type_x --> goes to bucket x
type_y --> goes to bucket y
My usecase is streaming data from Kafka to GCS bucket, so if someone suggest some better way to do it in GCP its welcome too.
Thanks.
Regards,
Anant.
You can use Secor to load messages to a GCS bucket. Secor is also able to parse incoming messages and puts them under different paths in the same bucket.
You can take a look at the example present here - https://github.com/0x0ece/beam-starter/blob/master/src/main/java/com/dataradiant/beam/examples/StreamWordCount.java
Once you have read the data elements if you want to write to multiple destinations based on a specific data value you can look at multiple outputs using TupleTagList the details of which can be found here - https://beam.apache.org/documentation/programming-guide/#additional-outputs
I am testing out the transfer function in GCP:
This is the open data in csv, https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2018-financial-year-provisional/Download-data/annual-enterprise-survey-2018-financial-year-provisional-csv.csv
My configuration in GCP:
The transfer failed as below:
Question 1: why the transfer failed?
Question 2: where is the error log?
Thank you very much.
[UPDATE]:
I checked log history, nothing was captured:
[Update 2]:
Error details:
Details: First line in URL list must be TsvHttpData-1.0 but it is: Year,Industry_aggregation_NZSIOC,Industry_code_NZSIOC,Industry_name_NZSIOC,Units,Variable_code,Variable_name,Variable_category,Value,Industry_code_ANZSIC06
I noticed in the transfer service if you choose the third option for source: it reads URL of TSV file. Essentially TSV, PSV are just variants of CSV, and I have no problem retrieving the source csv file. The error details seem to implicating something not expected there.
The problem is that in your example, you are pointing to a data file as the source of the transfer. If we read the documentation on GCS transfer, we find that the we must specify a file which contains the identity of the target URL that we want to copy.
The format of this file is called a Tab-Separated-Values (TSV) and contains a number of parameters including:
The URL of the source of the file.
The size in bytes of the source file.
An MD5 hash of the content of the source file.
What you specified (just the URL of the source file) ... is not what is required.
One possible solution would be to use gsutil. It has an option of taking a stream as input and writing that stream to a given object. For example:
curl http://[URL]/[PATH] | gsutil cp - gs://[BUCKET]/[OBJECT]
References:
Creating a URL list
Can I upload files to google cloud storage from url?