Transfer CSV File from GCS to BQ using DataFlow Template - google-cloud-platform

I have a list of CSV Files in GCS Bucket. I want to Load those files in BQ.
BEFORE LOADING ALL COLUMNS NEEDED CONVERT INTO STRING TYPE
I am using Dataflow Template Text File on Cloud Storage To BigQuery
Where a JavaScript UserDefinedFunction(UDF) needed to be mentioned and a JSON For Defining the BigQuerytable Schema.
In JSON Schema needs to Convert each Column as a String. (It's a Tedious Task as each CSV has 50+ column and I have to manually write column name in UDF and JSON )
Is there any Approach to DO this?

Related

Spanner to CSV DataFlow

I am trying to copy table from spanner to big query. I have created two dataflow. One which copies from spanner to text file and other one that imports text file into bigquery.
Table has a column which has JSON string as a value. Issue is seen when dataflow job runs while importing from text file to bigquery. Job throws below error :
INVALD JSON: :1:38 Expected eof but found, "602...
Is there anyway I can exclude this column while copying or any way I can copy JSON object as it is? I tried excluding this column in schema file but it did not help.
Thank you!
Looking at https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#cloud-spanner-to-cloud-storage-text there are options on BigQuery import jobs that would allow to skip columns, neither Cloud Spanner options that would skip a column when extracting.
I think your best shot is to write a custom processor that will drop the column, similar to Cleaning data in CSV files using dataflow.
it's more complicated but you can also try DataPrep: http://cloud/dataprep/docs/html/Drop-Transform_57344635. It should be possible to run DataPrep jobs as a DataFlow template.

Big query table update from reference CSV file

I have a BigQuery table and I want to update the content of few rows it from a reference CSV file. This CSV file is uploaded to Google cloud storage bucket.
When you use external table from storage, you can only read the CSV, not update them.
However, you can load you CSV into a BigQuery native table, perform the update with DML, and then export the table to CSV. That only works if you have only one CSV.
If you have several CSV files, you can, at least, print the pseudo column _FILE_NAME to identify the files where you need to perform the change. But the change will have to be performed manually or with the previous solution (native table)

AWS CVS data pipelining

I am new to AWS and want to do some data pipelining in AWS.
I have a bunch on CSV file stored in S3
Things I want to achieve:
I want to union all the CSV files and add the filename to each
line, the first line needs to be removed for each file before
unioning the CSVs;
Split the filename column by the _ delimiter;
Store this all in a DB after processing.
What is the best/fastest way to achieve this in a way.
Thanks
You can create a glue job using pyspark which will get the csv file in df and then you can transform it however you like.
After that you can convert that df to parquet and save that in s3.
Then you can run a glue crawler which will convert the parquet data to table which you can query.
Basically you are doing ETL using glue aws.

Big Query can't query some csvs in Cloud Storage bucket

I created a permanent Big Query table that reads some csv files from a Cloud Storage Bucket sharing the same prefix name (filename*.csv) and the same schema.
There are some csvs anyway that make fail BigQuery queries with a message like the following one: "Error while reading table: xxxx.xxxx.xxx, error message: CSV table references column position 5, but line starting at position:10 contains only 2 columns.
Moving all the csvs one-by-one from the bucket I devised the one responsible for that.
This csv file doesn't have 10 lines...
I found this ticket BigQuery error when loading csv file from Google Cloud Storage, so I thought the issue was having an empty line at the end. But also others csvs in my bucket do, so this can't be the reason.
On the other hand this csv is the only one with content type text/csv; charset=utf-8, all the others being text/csv,application/vnd.ms-excel,application/octet-stream.
Furthermore downloading this csv to my local Windows machine and uploading it againt to Cloud Storage, content type is automatically converted to application/vnd.ms-excel.
Then even with the missing line Big Query can then query the permanent table based on filename*.csvs.
Is it possible that BigQuery had issues querying csvs with UTF-8 encoding, or is it just coincidence?
Use Google Cloud Dataprep to load your csv file. Once the file is loaded, analyze the data and clean it if requires.
Once all the rows are cleaned, you can then sink that data in BQ.
Dataprep is GUI based ETL tool and it runs a dataflow job internally.
Do let me know if any more clarification is required.
Just to remark the issue, the CSV file had gzip as encoding which was the reason that BigQuery doesn't interpret as a CSV file.
According to documentation BigQuery expects CSV data to be UTF-8 encoded:
"encoding": "UTF-8"
In addition, since this issue is relate to the metadata of the files in GCS you can edit the metadata directly from the Console.

Is it possible to create a AWS glue classifier which can convert the csv file to pipe delimited

I would like to convert a monthly feed to convert from csv to pipe delimited using AWS Glue Crawler. Is it possible to create a classifier which can convert csv file to pipe delimited (Using Grok or something) and monthly scheduled crawler can create the Glue catalog
Glue Crawler is used for populating the AWS Glue Data Catalog with tables so you cannot convert your file from csv format to pipe delimited by using only this functionality. Right steps should be like this:
Creating two tables in Glue Data Catalog. One for file in CSV format, and one for pipe delimited format. To catalog the source table, you can use Glue Crawler.
Creating glue job to transfer data between these tables.
This article does not refer exactly to your problem, but you can see how these mentioned steps should look:
https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and-amazon-s3/
You have also tutorials in Glue console (at the bottom in the left menu)