Spanner to CSV DataFlow - google-cloud-platform

I am trying to copy table from spanner to big query. I have created two dataflow. One which copies from spanner to text file and other one that imports text file into bigquery.
Table has a column which has JSON string as a value. Issue is seen when dataflow job runs while importing from text file to bigquery. Job throws below error :
INVALD JSON: :1:38 Expected eof but found, "602...
Is there anyway I can exclude this column while copying or any way I can copy JSON object as it is? I tried excluding this column in schema file but it did not help.
Thank you!

Looking at https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#cloud-spanner-to-cloud-storage-text there are options on BigQuery import jobs that would allow to skip columns, neither Cloud Spanner options that would skip a column when extracting.
I think your best shot is to write a custom processor that will drop the column, similar to Cleaning data in CSV files using dataflow.
it's more complicated but you can also try DataPrep: http://cloud/dataprep/docs/html/Drop-Transform_57344635. It should be possible to run DataPrep jobs as a DataFlow template.

Related

How does Amazon Athena manage rename of columns?

everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.

Aws Glue Crawler is not updating the table after 1st crawl

I am adding a new file in parquet format which is created by a Glue Databrew in my S3 folder. The new file has the same schema as the previous file. But when I am running the Crawler for the 2nd time it is neither updating the table nor creating a new one in the data catalog. Also when I am crawling both the files together, both of them are getting added.
Log File is giving the following information:
INFO : Created partitions with values [[New file name]] for table
BENCHMARK : Finished writing to Catalog
I have tried with and without "Create a single schema for each S3 path". But the crawler is not updating the table with the new file. Sooner I will add new files on a daily basis to do my analysis. Any solution?
The best way to approach this issue in my opinion is to use AWS DataBrew output to Data Catalog directly. Data Catalog can be updated either by the crawler or by DataBrew directly but the recommended practice is that you employ any one of those mechanisms not both.
Can you try running the job with output as your data catalog and let Databrew manage your catalog? It should update your catalog table with right data/files.

Dataprep - Append data to BigQuery table

I'm using GCP's dataprep to join several csv files with the same column structure, treat some data and then write to a BigQuery database.
I have to record this data in BigQuery. Can I include this data from the dataprep and append them in a BigQuery table?
Yes, you can include your data from DataPrep and append them in a BigQuery table.
Before running the job, in the "Run Job on DataFlow" section:
Click on the action, since you are using BigQuery for the output, should look like "Create-BigQuery"
In the next windows choose you output table
In the left panel select "Append to this table every run"
Click on update
Now, when you run the Job, this will append your data.
The following documentation can be useful.
Yes, there is possibility to truncate data or append data to BigQuery table. In the output step DataPrep step in BigQuery table selection you can set that will be appended to table.
My difficultly is in "Connect your data" step. I have plenty of tables with the prefix events_ and I want them all. My intuition is to parameterize doing that events_*, but I do not have this option in bigquery table ingestion.

How can I vlookup two files with the help of Google Bigquery and Google Storage?

I need your support to vlookup a column from a file to my BigQuery spreadsheet.
Currently I have a project.dataset.spreadsheet in Google BigQuery with several columns including one with the field name "query".
And I have another spreadsheet as an xlsx file inside my Google Cloud Storage that has the columns "query" and "group".
I would like to add the right "group" value from this file field to my BigQuery spreadsheet.
Thanks in advance and cheers from Austria!
Nes
The that I will address the problem is to transform the xlsx file to CSV, load it into bigquery and perform a join with your existing data into bigquery.
If you need a new table, perform an insert select into a new table, else a view could be enough.
However, my answer is maybe incomplete. Depends on the need of repeat or not the process with your xlsx file and other details that you might not describe.
Cheers from vienne (but in France !)

Big Query can't query some csvs in Cloud Storage bucket

I created a permanent Big Query table that reads some csv files from a Cloud Storage Bucket sharing the same prefix name (filename*.csv) and the same schema.
There are some csvs anyway that make fail BigQuery queries with a message like the following one: "Error while reading table: xxxx.xxxx.xxx, error message: CSV table references column position 5, but line starting at position:10 contains only 2 columns.
Moving all the csvs one-by-one from the bucket I devised the one responsible for that.
This csv file doesn't have 10 lines...
I found this ticket BigQuery error when loading csv file from Google Cloud Storage, so I thought the issue was having an empty line at the end. But also others csvs in my bucket do, so this can't be the reason.
On the other hand this csv is the only one with content type text/csv; charset=utf-8, all the others being text/csv,application/vnd.ms-excel,application/octet-stream.
Furthermore downloading this csv to my local Windows machine and uploading it againt to Cloud Storage, content type is automatically converted to application/vnd.ms-excel.
Then even with the missing line Big Query can then query the permanent table based on filename*.csvs.
Is it possible that BigQuery had issues querying csvs with UTF-8 encoding, or is it just coincidence?
Use Google Cloud Dataprep to load your csv file. Once the file is loaded, analyze the data and clean it if requires.
Once all the rows are cleaned, you can then sink that data in BQ.
Dataprep is GUI based ETL tool and it runs a dataflow job internally.
Do let me know if any more clarification is required.
Just to remark the issue, the CSV file had gzip as encoding which was the reason that BigQuery doesn't interpret as a CSV file.
According to documentation BigQuery expects CSV data to be UTF-8 encoded:
"encoding": "UTF-8"
In addition, since this issue is relate to the metadata of the files in GCS you can edit the metadata directly from the Console.