Best way to load GCP AssetInventory into BigTable - google-cloud-platform

Asset inventories are exported to Cloud Storage in JSON. I want to load this data into BigTable, but I'm not sure what the best approach is.
I'm thinking the pipeline will look something like Cloud Storage > ETL to CSV/sequence files > Load into BigTable using DataFlow.
What are the options for loading JSON Cloud Storage data into BigTable?

Since both JSON and Bigtable are so flexible and amorphous, there are no pre-packaged conversions between the two. Google provides all of the pieces, but you have to write some code to glue them together. Specifically mapping the JSON documents to Bigtable's columns and rows has to be done from scratch.
Use a Cloud Asset client [1] to export the assets to GCS.
Depending on the size of the exports, either
create a dataflow job that uses TextIO to read the export
or use a Cloud Storage Client [2] directly
Use your favorite JSON library to parse each line
Transform the each JSON object into a Bigtable mutation
Use either a Dataflow BigtableIO[3] sink or a Bigtable client [4] to write the data to Bigtable
[1] https://cloud.google.com/resource-manager/docs/cloud-asset-inventory/libraries
[2] https://cloud.google.com/storage/docs/reference/libraries
[3] https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.html
[4] https://cloud.google.com/bigtable/docs/reference/libraries

Related

How to schedule an export from a BigQuery table to Cloud Storage?

I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset. I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.
Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?
There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler.
The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. There are multiple programming languages to choose from for that, such as Python, Node.JS, and Go.
Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically.
As an example and more specifically, you can follow these steps:
Create a Cloud Function using Python with an HTTP trigger. To interact with BigQuery from within the code you need to use the BigQuery client library. Import it with from google.cloud import bigquery. Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:
# Imports the BigQuery client library
from google.cloud import bigquery
def hello_world(request):
# Replace these values according to your project
project_name = "YOUR_PROJECT_ID"
bucket_name = "YOUR_BUCKET"
dataset_name = "YOUR_DATASET"
table_name = "YOUR_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
Specify the client library dependency in the requirements.txt file
by adding this line:
google-cloud-bigquery
Create a Cloud Scheduler job. Set the Frequency you wish for
the job to be executed with. For instance, setting it to 0 1 * * 0
would run the job once a week at 1 AM every Sunday morning. The
crontab tool is pretty useful when it comes to experimenting
with cron scheduling.
Choose HTTP as the Target, set the URL as the Cloud
Function's URL (it can be found by selecting the Cloud Function and
navigating to the Trigger tab), and as HTTP method choose GET.
Once created, and by pressing the RUN NOW button, you can test how the export
behaves. However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error. The default App Engine service account has a form of YOUR_PROJECT_ID#appspot.gserviceaccount.com.
If you wish to execute exports on different tables,
datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method
instead, and configure a Body containing said parameters as data, which
would be passed on to the Cloud Function - although, that would imply doing
some small changes in its code.
Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id>.
Not sure if this was in GA when this question was asked, but at least now there is an option to run an export to Cloud Storage via a regular SQL query. See the SQL tab in Exporting table data.
Example:
EXPORT DATA
OPTIONS (
uri = 'gs://bucket/folder/*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ';')
AS (
SELECT field1, field2
FROM mydataset.table1
ORDER BY field1
);
This could as well be trivially setup via a Scheduled Query if you need a periodic export. And, of course, you need to make sure the user or service account running this has permissions to read the source datasets and tables and to write to the destination bucket.
Hopefully this is useful for other peeps visiting this question if not for OP :)
You have an alternative to the second part of the Maxim answer. The code for extracting the table and store it into Cloud Storage should work.
But, when you schedule a query, you can also define a PubSub topic where the BigQuery scheduler will post a message when the job is over. Thereby, the scheduler set up, as described by Maxim is optional and you can simply plug the function to the PubSub notification.
Before performing the extraction, don't forget to check the error status of the pubsub notification. You have also a lot of information about the scheduled query; useful is you want to perform more checks or if you want to generalize the function.
So, another point about the SFTP transfert. I open sourced a projet for querying BigQuery, build a CSV file and transfert this file to FTP server (sFTP and FTPs aren't supported, because my previous company only used FTP protocol!). If your file is smaller than 1.5Gb, I can update my project for adding the SFTP support is you want to use this. Let me know

Is GCP Firestore Native Mode export to BQ import supported?

I was exploring option to load Firestore Native Mode data (collection and documents) into BQ. But its not working out for me.
Question: Does Big Query support import of extract from Firestore Native export?
Setup: 1 collection with multiple documents (no sub collections).
Steps:
- Export to Cloud Bucket: https://firebase.google.com/docs/firestore/manage-data/export-import
- Import in BQ: https://cloud.google.com/bigquery/docs/loading-data-cloud-firestore
Error While loading in BQ: 'Does not contain valid backup metadata'
Analysis: Its mentioned in the link that URI should have KIND_COLLECTION_ID and that file should end with [KIND_COLLECTION_ID].export_metadata. But none of these are true for Firestore Native mode export file. Its applicable for Firestore Datastore mode export.
Verify [KIND_COLLECTION_ID] is specified in your Cloud Storage URI. If you specify the URI without
[KIND_COLLECTION_ID], you receive the following error: does not contain valid backup metadata. (error
code: invalid)
The URI for your Cloud Firestore export file should end with [KIND_COLLECTION_ID].export_metadata.
For example: default_namespace_kind_Book.export_metadata. In this example, Book is the collection ID,
and default_namespace_kind_Book is the file name generated by Cloud Firestore
When one creates an export of firestore collections to GCS, a directory structure is created that looks like:
[Bucket]
- [Date/Time]
- [Date/Time].overall_export_metadata
- all_namespaces
- kind_[collection]
- all_namespaces_kind_[collection].export_metadata
When one imports an export into BigQuery, use the file:
[Bucket]/[Date/Time]/all_namespaces/kind_[collection]/all_namespaces_kind_[collection].export_metadata
Specifically, if one uses [Bucket]/[Date/Time]/[Date/Time].overall_export_metadata you will get the error you described. See also the note here under Console > Bullet 3 which reads:
Note: Do not use the file ending in overall_export_metadata. This file
is not usable by BigQuery.
If you want to create a pipeline from Firestore to Bigquery you should manualy format the Firestore collection to a Bigquery Table. I have used gcloud scheduler, cloud functions and firestore batched operations to migrate the data from Firestore to Bigquery. I created an example code here

How do I import JSON data from S3 using AWS Glue?

I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.

Datastore - Unable to Write Entities on Datastore using Dataflow Job

I have CSV file in the GCP Bucket, which I need to move to the Google Cloud Datastore. The CSV Shape format is (60000, 6). Using cloud Dataflow I wrote a Pipeline and moved into the datastore. The Dataflow Job is Successfully Completed. But when I check the data in data store there are no entities. This is the Pipeline image for your reference.
and the pool node graph is here.
.
From the pipeline Graph, I came to know that it didn't do any job on creating an Entities and writing into the datastore**(0 secs)**.
To do this job, I have referred this tutorial Uploading CSV File to The Datastore.
It will be more helpful if I get to know where the pipeline went wrong?

How to batch load custom Avro data generated from another source?

The Cloud Spanner docs say that Spanner can export/import Avro format. Can this path also be used for batch ingestion of Avro data generated from another source? The docs seem to suggest it can only import Avro data that was also generated by Spanner.
I ran a quick export job and took a look at the generated files. The manifest and schema look pretty straight forward. I figured I would post here in case this rabbit hole is deep.
manifest file
'
{
"files": [{
"name": "people.avro-00000-of-00001",
"md5": "HsMZeZFnKd06MVkmiG42Ag=="
}]
}
schema file
{
"tables": [{
"name": "people",
"manifestFile": "people-manifest.json"
}]
}
data file
{"type":"record",
"name":"people",
"namespace":
"spannerexport","
fields":[
{"name":"fullName",
"type":["null","string"],
"sqlType":"STRING(MAX)"},{"name":"memberId",
"type":"long",
"sqlType":"INT64"}
],
"googleStorage":"CloudSpanner",
"spannerPrimaryKey":"`memberId` ASC",
"spannerParent":"",
"spannerPrimaryKey_0":"`memberId` ASC",
"googleFormatVersion":"1.0.0"}
In response to your question, yes! There are two ways to do ingestion of Avro data into Cloud Spanner.
Method 1
If you place Avro files in a Google Cloud Storage bucket arranged as a Cloud Spanner export operation would arrange them and you generate a manifest formatted as Cloud Spanner expects, then using the import functionality in the web interface for Cloud Spanner will work. Obviously, there may be a lot of tedious formatting work here which is why the official documentation states that this "import process supports only Avro files exported from Cloud Spanner".
Method 2
Instead of executing the import/export job using the Cloud Spanner web console and relying on the Avro manifest and data files to be perfectly formatted, slightly modify the code in either of two public code repositories on GitHub under the Google Cloud Platform user that provide import/export (or backup/restore or export/ingest) functionality for moving data from Avro format into Google Cloud Spanner: (1) Dataflow Templates, especially this file (2) Pontem, especially this file.
Both of these have Dataflow jobs written that allow you to move data into and out of Cloud Spanner using the Avro format. Each has a specific means of parsing an Avro schema for input (i.e., moving data from Avro into Cloud Spanner). Since your use-case is input (i.e., ingesting data into Cloud Spanner that is Avro-formatted), you need to modify the Avro parsing code to fit your specific schema and then execute the Cloud Dataflow job from the commandline locally on your machine (the job is then uploaded to Google Cloud Platform).
If you are not familiar with Cloud Dataflow, it is a tool for defining and running jobs with large data sets.
As the documentation specifically states that importing only supports Avro files initially exported from Spanner 1, I've raised a feature request for this which you can track here
1 https://cloud.google.com/spanner/docs/import