How to load .zip files into BigQuery? - google-cloud-platform

We are able to load uncompressed CSV files and gzipped files completely fine.
However, if we want to load CSV files compressed in ".zip" - what is the best approach to move ahead?
Will we need to manually convert the zip to gz or BigQuery has added some support to handle this?
Thanks

BigQuery supports loading gzip files
The limitation is - If you use gzip compression BigQuery cannot read the data in parallel. Loading compressed CSV data into BigQuery is slower than loading uncompressed data.

You can try 42Layers.io for this. We use it to import ziped CSV files directly from FTP into BQ, and then set it on a schedule to do it every day. They also let you do field mapping to your existing tables within BQ. Pretty neat.

Related

Big Query can't query some csvs in Cloud Storage bucket

I created a permanent Big Query table that reads some csv files from a Cloud Storage Bucket sharing the same prefix name (filename*.csv) and the same schema.
There are some csvs anyway that make fail BigQuery queries with a message like the following one: "Error while reading table: xxxx.xxxx.xxx, error message: CSV table references column position 5, but line starting at position:10 contains only 2 columns.
Moving all the csvs one-by-one from the bucket I devised the one responsible for that.
This csv file doesn't have 10 lines...
I found this ticket BigQuery error when loading csv file from Google Cloud Storage, so I thought the issue was having an empty line at the end. But also others csvs in my bucket do, so this can't be the reason.
On the other hand this csv is the only one with content type text/csv; charset=utf-8, all the others being text/csv,application/vnd.ms-excel,application/octet-stream.
Furthermore downloading this csv to my local Windows machine and uploading it againt to Cloud Storage, content type is automatically converted to application/vnd.ms-excel.
Then even with the missing line Big Query can then query the permanent table based on filename*.csvs.
Is it possible that BigQuery had issues querying csvs with UTF-8 encoding, or is it just coincidence?
Use Google Cloud Dataprep to load your csv file. Once the file is loaded, analyze the data and clean it if requires.
Once all the rows are cleaned, you can then sink that data in BQ.
Dataprep is GUI based ETL tool and it runs a dataflow job internally.
Do let me know if any more clarification is required.
Just to remark the issue, the CSV file had gzip as encoding which was the reason that BigQuery doesn't interpret as a CSV file.
According to documentation BigQuery expects CSV data to be UTF-8 encoded:
"encoding": "UTF-8"
In addition, since this issue is relate to the metadata of the files in GCS you can edit the metadata directly from the Console.

BigQuery table: loading .7z file form cloud platform

I am trying to upload a compressed file from my GCS bucket into BigQuery.
In the new UI it is not clear how should I specify to uncompress the file.
I get an error specifying as if the gs://bucket/folder/file.7z is a .csv file.
Any help?
Unfortunately, .7z files are not supported by Bigquery, only gzip files and the decompression process is made automatically, after selecting the data format and creating the table.
If you consider that BigQuery should accept 7z files too, you could fill a feature request so the BigQuery engineers have it in mind for further releases.

Athena reading from AWS DMS CSV files

I've configures my DMS to read from a MySQL database and migrate its data to S3 with replication. Everything seems to work fine, it creates big CSV files for all the data and starts to create smaller CSV files with the deltas.
The problem is when I read this CSV files with AWS Glue Crawlers, they don't seem to get these deltas or even worse, they seem to get only the deltas, ignoring the big CSV files.
I know that there is a similar post here: Athena can't resolve CSV files from AWS DMS
But it is unaswered and I can't comment there, so I'm opening this one.
Does anyone have found the solution to this?
Best regards.

Process a compressed gz file to create table schema using Glue Data crawler

I have a compressed gzip file in an S3 bucket. The files will be uploaded to the S3 bucket daily by the client. The gzip when uncompressed will contain 10 files in CSV format, but with the same schema only. My objective is to process the gzip file, use a Data crawler to create table schema, and then load / merge all data to a new single table as a parquet file.
Can a Glue crawler read a gz file and create create tables as per the list of files. Please help with a solution.
Thanks.
Yes it can read gzip & zip csv's
https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html#classifier-built-in

Re-parsing Blob data stored in HDFS imported from Oracle by Sqoop

Using Sqoop I’ve successfully imported a few rows from a table that has a BLOB column.Now the part-m-00000 file contains all the records along with BLOB field as CSV.
Questions:
1) As per doc, knowledge about the Sqoop-specific format can help to read those blob records.
So , What does the Sqoop-specific format means ?
2) Basically the blob file is .gz file of a text file containing some float data in it. These .gz file is stored in Oracle DB as blob and imported into HDFS using Sqoop. So how could I be able to get back those float data from HDFS file.
Any sample code will of very great use.
I see these options.
Sqoop Import from Oracle directly to hive table with a binary data type. This option may limit the processing capabilities outside hive like MR, pig etc. i.e. you may need to know the knowledge of how the blob gets stored in hive as binary etc. The same limitation that you described in your question 1.
Sqoop import from oracle to avro, sequence or orc file formats which can hold binary. And you should be able to read this by creating a hive external table on top of it. You can write a hive UDF to decompress the binary data. This option is more flexible as the data can be processed easily with MR as well especially the avro, sequence file formats.
Hope this helps. How did you resolve?