I am trying to figure out how to import Google Analytics data into AWS Redshift. Until now I have been able to setup an export job so the data makes it to Google's BigQuery and then exporting the tables to Google's Cloud Storage.
BigQuery stores data in particular way, so when you export it to a file, it gives you a multilevel nested JSON structure. So, in order to import it to Redshift, I would have to "explode" that JSON into a table or CSV file.
I haven't been able to find a simple solution to do this.
Does anyone know how I can do this in an elegant and efficient way, instead of having to write a long function that will go through the whole JSON object?
Here's Google's documentation about how to export data https://cloud.google.com/bigquery/docs/exporting-data
You can try the following:
Export your BigQuery data as json into the S3 bucket
Create JSONPaths file according to specification
Include JSONPaths file in your COPY command to import into the Redshift
You may also try to export your BigQuery table as AVRO (one of the supported export file formas in BigQuery) instead of json. This link has an example of how to write the JSONPaths file for nested AVRO objects.
Related
Just wondering if it is possible to not just import from CSV-based data sets for the Vertex AI feature store, but from parquet or delta file formats as well. When trying to import a dataset from within GCP, the only options it gives are from BigQuery or from CSV.
I have attached a picture of the options given
No Parquet Option - Only CSV and BigQuery
Does anyone know if there is an API/plug-in/other methodology from which one can load parquet or delta files directly into the Vertex AI feature store?
Thank you!
Sorry, but no.
Vertex AI's Feature Store can only ingest from BigQuery tables or CloudStorage files- the latter of which must be either csv or avro format.
I am not aware of any planned support for Parquet format. However you can load Parquet into BigQuery, so that may be your best option.
I have been trying to import a sample nosql db to gcp datastore. when stored in gcs datastore is asking for data in specific extension i.e
.overall_export_metadata.
I don't believe there are any existing tools that can just import a CSV into datastore. You could write a Google Dataflow job to do this.
https://beam.apache.org/documentation/programming-guide/
https://cloud.google.com/dataflow/docs/quickstarts
It does look like they provide a template-based job that takes in a JSON file and writes it to datastore
https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#gcstexttodatastore
JSON format:
https://cloud.google.com/datastore/docs/reference/data/rest/v1/Entity
I want to export output of my select statement in Bigquery. There is "Save Results" option available but it has limit of 16,000 rows.
How can i export more than that?
If your data has more than 16,000 rows you'd need to save the result of your query as a BigQuery Table.
Afterwards, export the data from the table into Google Cloud Storage using any of the available options (such as the Cloud Console, API, bq or client libraries).
Finally, you can use any of the available methods in Google Cloud Storage (Cloud Console, API, gsutil or client libraries) to download the CSV file within your local environment.
Since for the time being you can't export data from a Table directly to a local file.
There is a quite easier step by using the export function. You can perform a simple query to export directly to GCS
EXPORT DATA OPTIONS(
uri='gs://MyBucket/path/to/file*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=',') AS
SELECT * from MyTable WHERE condition.
However, it's greatly possible that several files are generated. And thus you have to get all of them with gsutil.
The advantage of Daniel's solution, is that you can choose to export the table to only one file.
I have unstructured data in the form of document images. We are converting these documents to JSON files. I now want to have technical metadata captured for this. Can someone please give me some tips/best practices for building a data catalog on unstructured data in Google Cloud Platform?
This answer comes with the assumption that you are not using any tool to create schemas around your unstructured data and query your data, like BigQuery, Hive, Presto. And you simply want to catalog your files.
I had a similar use case, Google Data Catalog has an option to create custom entries.
Some tips on building a Data Catalog on unstructured files data:
Use meaningful file names on your JSON files. That way searching for them will become easier.
Since you are already using GCP, use their managed Data Catalog, and leverage their custom entries API to ingest the files metadata into it.
In case you also want to look for sensitive data in your JSON files, you could run DLP on them.
Use Data Catalog Tags to enrich the files metadata. The tutorial on the link shows how to do it on Big Query tables, but you can do the same on custom entries.
I would add some information about your ETL jobs that convert these documents in JSON files as Tags. Like execution time, data quality score, user, business owner, etc.
In case you are wondering how to do the step 2, I put together one script that automatically does that:
link for the GitHub. Another option is to work with Data Catalog Filesets.
So between using custom entries or filesets, I'd ask you this, do you need information about your files name?
If not then filesets might easier, since at the time of this writing it does not show any info about your files name, but are good to manage file patterns in GCS buckets: It is defined by one or more file patterns that specify a set of one or more Cloud Storage files.
The datatalog-util also has an option to enrich your filesets, in case you just want to have statistics about them, like average file size, types, etc.
I created a permanent Big Query table that reads some csv files from a Cloud Storage Bucket sharing the same prefix name (filename*.csv) and the same schema.
There are some csvs anyway that make fail BigQuery queries with a message like the following one: "Error while reading table: xxxx.xxxx.xxx, error message: CSV table references column position 5, but line starting at position:10 contains only 2 columns.
Moving all the csvs one-by-one from the bucket I devised the one responsible for that.
This csv file doesn't have 10 lines...
I found this ticket BigQuery error when loading csv file from Google Cloud Storage, so I thought the issue was having an empty line at the end. But also others csvs in my bucket do, so this can't be the reason.
On the other hand this csv is the only one with content type text/csv; charset=utf-8, all the others being text/csv,application/vnd.ms-excel,application/octet-stream.
Furthermore downloading this csv to my local Windows machine and uploading it againt to Cloud Storage, content type is automatically converted to application/vnd.ms-excel.
Then even with the missing line Big Query can then query the permanent table based on filename*.csvs.
Is it possible that BigQuery had issues querying csvs with UTF-8 encoding, or is it just coincidence?
Use Google Cloud Dataprep to load your csv file. Once the file is loaded, analyze the data and clean it if requires.
Once all the rows are cleaned, you can then sink that data in BQ.
Dataprep is GUI based ETL tool and it runs a dataflow job internally.
Do let me know if any more clarification is required.
Just to remark the issue, the CSV file had gzip as encoding which was the reason that BigQuery doesn't interpret as a CSV file.
According to documentation BigQuery expects CSV data to be UTF-8 encoded:
"encoding": "UTF-8"
In addition, since this issue is relate to the metadata of the files in GCS you can edit the metadata directly from the Console.