I want to solve Parsing Error occurring in uploading CSV files to AWS Neptune.
The problem may be occurred by column name and its type, but I do not know what types are right to write in the header.
I transformed types of all the data as string before uploading CSVs.
Problem does not occur:"~id","pv_time:String","order_num:String","staff_num:String","~label"
Ploblem occurs:"order_num","order_from:String","order_to:String","station_name:String","~label"
The ~id and ~label headers are required.
Related
I am using BQ Data Transfer to move some zipped JSON data from s3 to BQ.
I am receiving the following error and I'd like to dig deeper into it.
"jsonPayload": {"message": "Job xyz (table ybz) failed with error INVALID_ARGUMENT: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details. File: gs://some-file-name; JobID: PID"},
When trying to connect that URL (replacing the gs:// part with https://storage.googleapis.com/) and I get
<Details>Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).</Details>
That storage can't be found on my GCP Storage buckets.
I suspect there are badly formatted JSON, but without clearly looking at the logs and errors I can't get back to the s3 bucket owner with relevant information.
You can refer to this document to BigQuery Data Transfer Service for Amazon S3.
When you load JSON files into BigQuery, note the following:
JSON data must be newline delimited. Each JSON object must be on a separate line in the file.
If you use gzip compression, BigQuery cannot read the data in parallel. Loading compressed JSON data into BigQuery is slower than loading uncompressed data.
You cannot include both compressed and uncompressed files in the same load job.
The maximum size for a gzip file is 4 GB.
BigQuery supports the JSON type even if schema information is not known at the time of ingestion. A field that is declared as JSON type is loaded with the raw JSON values.
For more information regarding limitations about the Amazon S3 transfers you can refer to this document.
To view logs of BigQuery Data transfers in logs explorer, you can use this filter:
resource.type="bigquery_dts_config"
labels.run_id="transfer_run_id"
I'm trying to load ORC data files stored in GCS into BigQuery via bq load/bq mk and facing an error below. The data files copied via hadoop discp command from on-prem cluster's Hive instance version 1.2. Most of the orc-files are loaded successfully, but few are not. There is no problem when I read this data from Hive.
Command I used:
$ bq load --source_format ORC hadoop_migration.pm hive/part-v006-o000-r-00000_a_17
Upload complete.
Waiting on bqjob_r7233761202886bd8_00000175f4b18a74_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '<project>-af9bd5f6:bqjob_r7233761202886bd8_00000175f4b18a74_1': Error while reading data, error message:
The Apache Orc library failed to parse metadata of stripes with error: failed to open /usr/share/zoneinfo/GMT-00:00 - No such file or directory
Indeed, there is no such file and I believe it shouldn't be.
Google doesn't know about this error message but I've found similar problem here: https://issues.apache.org/jira/browse/ARROW-4966. There is a workaround for on-prem servers of creating sym-link to /usr/share/zoneinfo/GMT-00:00. But I'm in a Cloud.
Additionally, I found that if I extract data from orc file via orc-tools into json format I'm able to load that json file into BigQuery. So I suspect that the problem not in the data itself.
Does anybody came across such problem?
Official Google support position below. In short BigQuery doesn't understand some timezone's description and we suggested to change it in the data. Our workaround for this was to convert ORC data to parquet and then load it into table.
Indeed this error can happen. Also when you try to execute a query from the BigQuery Cloud Console such as:
select timestamp('2020-01-01 00:00:00 GMT-00:00')
you’ll get the same error. It is not just related to the ORC import, it’s how BigQuery understands timestamps. BigQuery supports a wide range of representations as described in [1]. So:
“2020-01-01 00:00:00 GMT-00:00” -- incorrect timestamp string literal
“2020-01-01 00:00:00 abcdef” -- incorrect timestamp string literal
“2020-01-01 00:00:00-00:00” -- correct timestamp string literal
In your case the problem is with the representation of the time zone within the ORC file. I suppose it was generated that way by some external system. If you were able to get the “GMT-00:00” string with preceding space replaced with just “-00:00” that would be the correct name of the time zone. Can you change the configuration of the system which generated the file into having a proper time zone string?
Creating a symlink is only masking the problem and not solving it properly. In case of BigQuery it is not possible.
Best regards,
Google Cloud Support
[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#time_zones
I have five JSON files in one folder in amazon s3. I am trying to load all five files from s3 into redshift using copy command. I am getting an error while loading one file from s3 to redshift. Is there any way in redshift to skip loading that file and load the next file.
Use the MAXERROR parameter in the COPY command to increase the number of errors permitted. This will skip over any lines that produce errors.
Then, use the STL_LOAD_ERRORS table to view the errors and diagnose the data problem.
After saving Avro files with snappy compression (also same error with gzip/bzip2 compression) in S3 using AWS Glue, when I try to read the data in athena using AWS Crawler, I get the following error - HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split - using org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat: Not a data file. Any idea why I get this error and how to resolve this?
Thank you.
Circumvented this issue by attaching native spark avro jar file to the glue job during execution and using native spark read/write methods to write them in avro format and for the compression setting spark.conf.set("spark.sql.avro.compression.codec","snappy") as soon as the spark session is created.
Works perfectly for me and could be read via Athena as well.
AWS Glue doesn't support writing avro with compression files even though it's not stated clearly in docs. A job succeeds but it applies compressions in a wrong way: instead of compressing file blocks it compresses entire file that is wrong and that's the reason why Athena can't query it.
There are plans to fix the issue but I don't know ETA.
It would be nice if you could contact AWS support to let them know that you are having this issue too (more customers affected - sooner fixed)
I'm trying to run a training job on AWS Sagemaker, but it keeps failing giving the following error:
ClientError: Unable to parse csv: rows 1-5000, file /opt/ml/input/data/train/KMeans_data.csv
I've selected 'text/csv' as the content type and my CSV file contains 5 columns with numerical content and text headers.
Can anyone point out what could be going wrong here?
Thanks!
From https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html CSV must not have headers:
Amazon SageMaker requires that a CSV file doesn't have a header record ...
Try removing the header row.
Try to make sure that there are no other files other than the training file in the training folder in S3 bucket.