GCP Data Catalog Schema History or Versioning - google-cloud-platform

I've been wondering if it is possible to have versions of schema in GCP Data Catalog Service? Or maybe advice on how you deal with Data Catalog entries when schema is changed (e.g. in CloudSQL, GCS fileset, BigQuery) and how history could be handled if it is not supported by Google?
Tried to investigate Data Catalog API calls and Logging after entry is updated, however, there were no changes, no history.
I've found that functionality in AWS (https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html).
There is also question in GCP Community that is unanswered: https://www.googlecloudcommunity.com/gc/Data-Analytics/How-can-I-see-entry-history/m-p/425135#M338
Custom tools, such as Liquibase (https://medium.com/google-cloud/version-control-of-bigquery-schema-changes-with-liquibase-ddc7092d6d1d), are not suitable in this case, as they are limited for BigQuery (not all GCP services).
I expect ANY versioning of Data Catalog Entries (schemas in particular), history in logs or such.

Unfortunately currently there is no versioning in Data Catalog. If entry schema in ingested systems is changed the tag attached to removed column is lost. For simple versioning use cases you may consider using terraform with storing configuration in Cloud Source Repositories.

Related

Is it possible to interact Cloud SQL or Cloud Spanner with Google Data Catalog

I am able to discover Bigquery datasets,GCS files in Google Data Catalog but I could not find Cloud SQl or Cloud Spanner options in Cloud Data Catalog UI.
Is it possible to view Cloud SQL tables , Cloud Spanner tables data in Data Catalog? If yes please suggest steps or provide documents links.
Thanks.
Yes, It is Possible using Data Catalog custom entries.
To view Cloud SQL tables, you can use the open source connectors for MySQL, SQL Server and PostgreSQL.
Also check the on-premise ingestion use cases from the official docs.
Yes, It is Possible
Details:
Other than Native-Metadata types GCS, PUB/SUB and BigQuery are needs to be dealt via Catalog-APIs
Ref: https://cloud.google.com/data-catalog/docs/how-to/custom-entries
ie,
Use One of 7 languages to programatically loop-through all the tables from custom data source (eg..BigTable) and create Tag Dynamically.
My Favorite Python & C#
Much appreciated if anyone else has better alternative approach
Unfortunately, there is no a native integration between Data Catalog, Cloud SQL and Cloud Spanner. Nevertheless, there is an issue tracker regarding this feature reported.
As you can see in the shared link, as a work around you can manually create a JDBC connector to Spanner and export the metadata to Data Catalog custom entries on a schedule. Something like the Mahendren's suggestion. Something similar you can perform with Cloud SQL

What is drawback of AWS Glue Data Catalog?

What is major drawback of AWS Glue Data Catalog? Been asked in one of interview.
That could be answered in a number of ways depending on the wider context. For example:
It's an AWS managed service, so using it locks you into the AWS
ecosystem (instead of using a standalone Hive metastore for example)
It's limited to the data sources supported by the Glue Data Catalog
It doesn't integrate with third-party authentication and
authorisation tools like Apache Ranger (as far as I am aware)

How to create data catalog for assets in google cloud storage (objects, buckets, ..) using Google data catalog service

Though Google Data catalog is in beta phase, currently it provides data catalog service support to BigQuery and cloud pub/sub services, not for Google cloud storage(in beta phase).
Is there any way using existing components/services we could build data catalog for assets stored in Google cloud storage(buckets, objects, ..) and when could we possibly expect direct support for GCS in Google data catalog.
According to Google's documentation
Tagging Cloud Storage assets (for example, buckets and objects) is unavailable in the Data Catalog beta release.
Full support for Cloud Data Catalog is scheduled for the last quarter of this year, 2019. (from October-on)
Data Catalog became GA: Data Catalog GA
And they have updated the docs for Filesets:
Data Catalog Filesets
Finally if you want to create Data Catalog assets for each of your cloud storage objects, you may use this open source script: datacatalog-util which has an option to create Entries for your files.

Aws: best approach to process data from S3 to RDS

I'm trying to implement, I think, a very simple process, but I don't really know what's the best approach.
I want to read a big csv (around 30gb) file from S3, make some transformation and load it into RDS MySQL and I want this process to be replicable.
I tought that the best approach was Aws data pipeline, but I've found that this service is more designed to load data from different sources to redshift after several transformtions.
I've also seen that the process of creating a pipeline is slow and a little bit messy.
Then I've found the dataduct wrapper of Coursera, but after some research, it seems that this project has been abandoned (the last commit was one year ago).
So I don't know if I should continue trying with aws data pipeline or take another approach.
I've also read about AWS Simple Workflow and Step Functions, but I don't know if it's simpler.
Then I've seen a video of AWS glue and it looks nice, but unfortunatelly it's not yet available and I don't know when Amazon will launch it.
As you see, I'm a little bit confuse, can anyone enlight me?
Thanks in advance
If you are trying to get them into RDS so you can query them, there are other options that do not require the data to be moved from S3 to RDS to do SQL like queries.
You can use Redshift spectrum to read and query information from S3 now.
Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
Step 1. Create an IAM Role for Amazon Redshift
Step 2: Associate the IAM Role with Your Cluster
Step 3: Create an External Schema and an External Table
Step 4: Query Your Data in Amazon S3
Or you can use Athena to query the data in S3 as well if Redshift is too much horsepower for the need job.
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
You could use an ETL tool to do the transformations on your csv data and then load it into your RDS database. There are a number of open source tools that do not require large licensing costs. That way you can pull the data into the tool, do your transformations and then the tool will load the data into your MySQL database. For example there is Talend, Apache Kafka, and Scriptella. Here's some information on them for comparison.
I think Scriptella would be an option for this situation. It can use SQL scripts (or other scripting languages), and has JDBC/ODBC compliant drivers. With this you could create a script that would perform your transformations and then load the data into your MySQL database. And you would be using familiar SQL (I'm assuming you already can create SQL scripts) so there isn't a big learning curve.

Is it possible to see a history of interaction with tables in a Redshift schema?

Ultimately, I would like to obtain a list of tables in a particular schema that haven't been queried in the last two weeks (say).
I know that there are many system tables that track various things about how the Redshift cluster is functioning, but I have yet to find one that I could use to obtain the above.
Is what I want to do possible?
Please have a look at our "Unscanned Tables" query: https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/unscanned_table_summary.sql
If you have enabled audit logging for the cluster, activity data stored inside a S3 bucket which you configured while enabling logging.
According to AWS Documentation, audit log bucket structure is as follows.
AWSLogs/AccountID/ServiceName/Region/Year/Month/Day/AccountID_ServiceName_Region_ClusterName_LogType_Timestamp.gz
For example: AWSLogs/123456789012/redshift/us-east-1/2013/10/29/123456789012_redshift_us-east-1_mycluster_userlog_2013-10-29T18:01.gz