I'm trying to export 1TB of hive data using hive -e as we dont have option to access hdfs file system and load the data to Redshift . The data has been exported in multiple small files like 30000+ small PARQUET files sums upto 1TB of data. To load the data into redshift it is throwing a error
String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence: e9 (error 2)
Options Tried:
ACCEPTINVCHARS -- which is not available for parquet format
Try to load using Athena -> Glue cralwer -> Redshift . Not straightforward solution as we have to do the same in 40+ tables in hive.
How to build a pipeline to copy the data from Hive and load into Redshift . S3 load also can be skipped.
I've held off on answering since I'm not a Hive expert. The issue is the character encoding of the files. Redshift uses multi-byte UTF8 (like most of the internet) and these files are encoded differently (likely UTF16 coming from Windows, but that is just a guess). I believe that Hive can operate on both of these character sets (by configuring the SerDe but again I'm not a Hive expert). What I don't know is if Hive can read in one encoding and export in another.
When I have used Hive it has preserved the input encoding to the output. So one option would be to change the file encodings to UTF8 from the source system feeding Hive. In the past I've done this from mySQL - export from mySQL in UTF8 and feed through Hive to Redshift. This is the easiest approach as it is just configuring a step that already exists.
Another approach is to convert the files from one encoding to the other. the Linux command iconv does this or you could write some code for a Lambda. This step could be inserted before or after Hive. You will need to know the current file encoding which should be in the file BOM. You can read this with the Linux command 'file '.
As I said above if Hive can do the conversion that would be great. I just don't know if it does this.
Bottom line - The issue is with the file encodings that Hive is running upon. These need to be changed to UTF8 for Redshift. This can be done at the source system, with a conversion tool, or possibly in Hive.
If you want to know a lot more on the subject see: https://github.com/boostcon/cppnow_presentations_2014/blob/master/files/unicode-cpp.pdf
Related
I am trying to run a bq job that that does the export to GCS files, how can I make it to extract them with encoding UTF-8
the part am using to extract is:
new JobConfigurationExtract()
.setSourceTable(sourceTableReference)
.setDestinationFormat("CSV")
.setDestinationUris(ImmutableList.of(destinationUri))
how can I configure this to use UTF-8 encoding?
While writing files in S3 through Glue job, how to give custom file-name and also with timestamp format ( for example - file-name_yyyy-mm-dd_hh-mm-ss) format ??
As by default, glue writes the output files in format part-0**
Since Glue is using Spark in the background it is not possible to change the file names directly.
There is the possibility to change it after you have written to S3 though. This answer provides a simple code snippet that should work.
I'm trying to load ORC data files stored in GCS into BigQuery via bq load/bq mk and facing an error below. The data files copied via hadoop discp command from on-prem cluster's Hive instance version 1.2. Most of the orc-files are loaded successfully, but few are not. There is no problem when I read this data from Hive.
Command I used:
$ bq load --source_format ORC hadoop_migration.pm hive/part-v006-o000-r-00000_a_17
Upload complete.
Waiting on bqjob_r7233761202886bd8_00000175f4b18a74_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '<project>-af9bd5f6:bqjob_r7233761202886bd8_00000175f4b18a74_1': Error while reading data, error message:
The Apache Orc library failed to parse metadata of stripes with error: failed to open /usr/share/zoneinfo/GMT-00:00 - No such file or directory
Indeed, there is no such file and I believe it shouldn't be.
Google doesn't know about this error message but I've found similar problem here: https://issues.apache.org/jira/browse/ARROW-4966. There is a workaround for on-prem servers of creating sym-link to /usr/share/zoneinfo/GMT-00:00. But I'm in a Cloud.
Additionally, I found that if I extract data from orc file via orc-tools into json format I'm able to load that json file into BigQuery. So I suspect that the problem not in the data itself.
Does anybody came across such problem?
Official Google support position below. In short BigQuery doesn't understand some timezone's description and we suggested to change it in the data. Our workaround for this was to convert ORC data to parquet and then load it into table.
Indeed this error can happen. Also when you try to execute a query from the BigQuery Cloud Console such as:
select timestamp('2020-01-01 00:00:00 GMT-00:00')
you’ll get the same error. It is not just related to the ORC import, it’s how BigQuery understands timestamps. BigQuery supports a wide range of representations as described in [1]. So:
“2020-01-01 00:00:00 GMT-00:00” -- incorrect timestamp string literal
“2020-01-01 00:00:00 abcdef” -- incorrect timestamp string literal
“2020-01-01 00:00:00-00:00” -- correct timestamp string literal
In your case the problem is with the representation of the time zone within the ORC file. I suppose it was generated that way by some external system. If you were able to get the “GMT-00:00” string with preceding space replaced with just “-00:00” that would be the correct name of the time zone. Can you change the configuration of the system which generated the file into having a proper time zone string?
Creating a symlink is only masking the problem and not solving it properly. In case of BigQuery it is not possible.
Best regards,
Google Cloud Support
[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#time_zones
I'm unable to get the default crawler classifier, nor a custom classifier to work against many of my CSV files. The classification is listed as 'UNKNOWN'. I've tried re-running existing classifiers, as well as creating new ones. Is anyone aware of a specific configuration for a custom classifier for CSV files that works for files of any size?
I'm also unable to find any errors specific to this issue in the logs.
Although I have seen reference to issues for JSON files over 1MB in size, I can't find anything detailing this same issue for CSV files, nor a solution to the problem.
AWS crawler could not classify the file type stores in S3 if its size >1MB
AWS Glue Crawler Classifies json file as UNKNOWN
Default CSV classifiers supported by Glue Crawler:
CSV - Checks for the following delimiters: comma (,), pipe (|), tab
(\t), semicolon (;), and Ctrl-A (\u0001). Ctrl-A is the Unicode
control character for Start Of Heading.
If you have any other delimiter, then it will not work with default CSV classfier. In that case you will have to write grok pattern.
After saving Avro files with snappy compression (also same error with gzip/bzip2 compression) in S3 using AWS Glue, when I try to read the data in athena using AWS Crawler, I get the following error - HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split - using org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat: Not a data file. Any idea why I get this error and how to resolve this?
Thank you.
Circumvented this issue by attaching native spark avro jar file to the glue job during execution and using native spark read/write methods to write them in avro format and for the compression setting spark.conf.set("spark.sql.avro.compression.codec","snappy") as soon as the spark session is created.
Works perfectly for me and could be read via Athena as well.
AWS Glue doesn't support writing avro with compression files even though it's not stated clearly in docs. A job succeeds but it applies compressions in a wrong way: instead of compressing file blocks it compresses entire file that is wrong and that's the reason why Athena can't query it.
There are plans to fix the issue but I don't know ETA.
It would be nice if you could contact AWS support to let them know that you are having this issue too (more customers affected - sooner fixed)