GCS to BigQuery table using stored Proc - google-cloud-platform

I want to create a stored Proc which can read data from GCS bucket and store into the table in bigquery. I was able to do it using python bu conecting to gcs and creating bigquery client.
credentials = service_account.Credentials.from_service_account_file(path_to_key)
bq_client = bigquery.Client(credentials=credentials, project=project_id)
Can we achieve the same using stored procedure?

Have you looked into using an external table? You can query directly off of Google Cloud Storage without needing to load anything. Just define the schema of the expected data along with the GCS URI's and once the data is in GCS it is accessible via SQL in BigQuery. Otherwise, no. There is no LOAD statement that you can execute via BigQuery SQL. See the docs here for all of the ways to load data into a table. You could have the external table and maybe create a stored procedure that does an INSERT into another table using the data from the external table you created. That is if you are really hell bent on having a stored procedure to "load" data to a normal BigQuery table. Otherwise the external tables are an excellent option to obviate the need to even load the data in the first place.

Related

Does GCP Data Loss Prevention support publishing its results to Data Catalog for External Big Query Tables

I was trying to auto tag InfoTypes like PhoneNumber, EmailId on the data in GCS Bucket and Big Query External tables using Data Loss Prevention Tool in GCP so that i can have those tags at Data Catalog and subsequently in Dataplex. Now the problems are that
If i select any sources other than Big Query table (GCS, Data Store etc.), the option to publish GCP DLP inspection results to Data Catalog is disabled.
If i select Big Query table, Data Catalog publish option is enabled but when i try to run the inspection job, its errors out saying , "External tables are not supported for inspection". Surprisingly it supports only internal big query tables.
The question is that, is my understanding of GCP DLP - Data Catalog integration works only for Internal Big Query tables correct? Am doing something wrong here, GCP documentation doesn't mention these things either!
Also while configuring the Inspection Job from the DLP UI Console, i had to provide Big Query tableid mandatorily, is there a way i can run DLP inspection job against a BQ Dataset or a bunch of tables?
Regarding Data Loss Prevention Services in Google Cloud, your understanding is correct, data cannot be ex-filtrated by copying to services outside the perimeter, e.g., a public Google Cloud Storage (GCS) bucket or an external BigQuery table. Visit this URL for more reference.
Now, about how to run a DLP Inspection job against a BQ bunch of tables, there are 2 ways to do it:
Programmatically fetch the Big Query tables, query the table and call DLP Streaming Content API. It operates in real time, but it is expensive. Here I share the concept in a Java example code:
url =
String.format(
"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType=3;ProjectId=%s;",
projectId);
DataSource ds = new com.simba.googlebigquery.jdbc42.DataSource();
ds.setURL(url);
conn = ds.getConnection();
DatabaseMetaData databaseMetadata = conn.getMetaData();
ResultSet tablesResultSet =
databaseMetadata.getTables(conn.getCatalog(), null, "%", new String[]{"TABLE"});
while (tablesResultSet.next()) {
// Query your Table Data and call DLP Streaming API
}
Here is a tutorial for this method.
Programmatically fetch the Big Query tables, and then trigger one Inspect Job for each table. It is the cheapest method, but you need to consider that it's a batch operation, so it doesn’t execute in real time. Here is the concept in a Python example:
client = bigquery.Client()
datasets = list(client.list_datasets(project=project_id))
if datasets:
for dataset in datasets:
tables = client.list_tables(dataset.dataset_id)
for table in tables:
# Create Inspect Job for table.table_id
Use this thread for more reference on running a DLP Inspection job against a BQ bunch of tables.

GCP moving Bigtable data to BigQuery

In GCP I would like to know if it possible to transfer/move data from Bigtable to BigQuery. i.e lets say I want all data greater than 1 year should be moved from Bigtable to BigQuery. Is this doable?
Can someone please help me on this
You can query the BigTable data from BigQuery thanks to the external table configuration.
Because you are able to query the data from BigQuery, you can perform an INSERT SELECT in a BigQuery table.
EDIT 1
You can't do it automatically. You must perform custom code to copy only the old data and then delete the old ones.
You must have a timestamp field for that. To copy the data from BigTable to BigQuery, you can use external table. But you can't delete the data from BigQuery external connection.
To purge BigTable data, you can use garbage collection feature.

Update BigQuery permanent external tables

I'm using BigQuery both to store data within "native" BigQuery tables and to query data stored in Google Cloud Storage. According to the documentation, it is possible to query external sources using two types of tables: permanent and temporary external tables.
Consider the following scenario: every day some parquet files are written in GCS, and with a certain frequency I want to do a JOIN between the data stored in a BigQuery table and the data stored in parquet files. If I create a permanent external table, and then I update the files below, is the content of the table automatically updated as well, or do I have to recreate it from the new files?
What are the best practices for such a scenario?
You don't have to re-create the external table again when you add new files into cloud storage bucket. The only exception is, if the number of columns is different in new file then the external table will not work as expected.
You need to use wildcard symbol to read files that matches to a specific pattern rather than providing a static file name. Example: "gs://bucketName/*.csv"

How data retrieved from metadata created tables in Glue Script

In AWS Glue, Although I read documentation, but I didn't get cleared one thing. Below is what I understood.
Regarding Crawlers: This will create a metadata table for either S3 or DynamoDB table. But what I don't understand is: how does Scala/Python script able to retrieve data from Actual Source (say DynamoDB or S3) using Metadata created tables.
val input = glueContext
.getCatalogSource(database = "my_data_base", tableName = "my_table")
.getDynamicFrame()
Does above line retrieve data from actual source via metadata tables?
I will be glad if someone can able to explain me behind the scenes of retrieving data in Glue script via metadata tables.
When you run a Glue crawler it will fetch metadata from S3 or JDBC (depends on your requirement) and creates tables in AWS Glue Data Catalog.
Now if you want to connect to this data/tables from Glue ETL job then you can do it in multiple ways depending on your requirement:
[from_options][1] : if you want to load directly from S3/JDBC with out connecting to Glue catalog.
[from_catalog][1] : If you want to load data from Glue catalog then you need to link it with catalog using getCatalogSource method as shown in your code. As the name infers it will use Glue data catalog as source and load particular table that you pass to this method.
Once it looks at your table definition which is pointed to a location then it will make a connection and load the data present in the source.
Yes you need to use getCatalogSource if you want to load tables from Glue catalog.
Does Catalog look into Crawler and refer to actual source and load data?
Check out the diagram in this [link][2] . It will give you an idea about the flow.
What if crawler deleted before I run getCatalogSource, then will I can able to load data in this case?
Crawler and Table are two different components. It all depends on when the table is deleted. If you delete the table after your job start to execute then there will not be any problem. If you delete it before execution starts then you will encounter an error.
What if my Source has lots of million of records? then will this load all records or how in this case?
It is good to have large files to be present in source so it will avoid most of the small files problem. Glue based on Spark and it will read files which can be fit in memory and then do the computations. Check this [answer][3] and [this][4] for best practices while reading larger files in AWS Glue.
[1]: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.html
[2]: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html
[3]: https://stackoverflow.com/questions/46638901/how-spark-read-a-large-file-petabyte-when-file-can-not-be-fit-in-sparks-main
[4]: https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/#:~:text=Incremental%20processing:%20Processing%20large%20datasets

Data Transformation in AWS EMR without using Scala or Python

I have a star schema kind of database structure, like one fact table having all the id’s & skeys, whereas there are multiple dimension tables having the actual id, code, descriptions for the id’s referred in the fact table.
we are moving all these tables (fact & dimensions) to S3 (cloud) individually and each table data are split into multiple parquet files in S3 location (one S3 object per table)
Query: i need to perform a transformation on cloud (ie) i need strip of all the id’s & skeys referred in the fact table and replace it with the actual code that is residing in the dimension tables and create another file and store the final output back in S3 location. This file will later be consumed by Redshift for Analytics.
My Doubt:
Whats the best way to achieve this solution, cos i don’t need raw data (skeys & id’s) in Redshift for cost and storage optimization?
Do we need to first combine these split files (parquet) into one large file (ie) before performing the data transformation. Also, after data transformation, I am planning to save the final output file in parquet format, but the catch is, Redshift doesn’t allow copy of parquet file, so is there a workaround for that
I am not a hardcore programmer and want to avoid using scala/python in a EMR, but I am good at SQL, so is there a way to perform data transformation in cloud thru SQL thru EMR and save the output data into a file or files. Please advise
You should be able to run redshift type queries directly against your s3 parquet data by using amazon athena
some information on that
https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/