Will IGC allow me to trace where data has been sourced from or how data is being consumed, for any ETL or Data Transformation Tool? - ibm-infosphere

As part of our Governance initiative and regulatory requirement, we need to produce a Lineage (tractability) report, outlining the flow of data into our Warehouse, and the Reports or Services consuming its data. We are aware that Information Governance Catalog can produce such a report automatically when DataStage is writing data to the Warehouse. Can Information Governance Catalog do the same when we use SQL Scripts or other tooling to read or write information to our Warehouse? Can I view a complete Lineage report, that incorporates such different information?
What are the steps within IGC to document or otherwise define the usage of information to support Data Lineage and Regulatory reporting?

Yes, while we can automate the production of Lineage (traceability) reports for DataStage, IGC does offer facility to document the flow of data for other data movement scripts, tools or processes. This will produce the same Lineage reports, that can be used to satisfy needs for compliance, or build confidence and trust in the use or consumption of data.
At it simplest, IGC allows one to draft a Mapping Document. Essentially a spreadsheet that delineates the Data Source and Data Target, and documentation to support the transformation, aggregation or other logic. The spreadsheet can be directly authored in IGC, or loaded from Excel (text file) which further supports automation of the process. Documentation for Extension Mapping Documents can be found here: https://www.ibm.com/support/knowledgecenter/en/SSZJPZ_11.5.0/com.ibm.swg.im.iis.mdwb.doc/topics/c_extensionMappings.html (though suggest creating such a document from IGC, and exporting the results to Excel).
In addition, IGC supports a more formal process for extending the Catalog and introducing new types of Assets. This would go one step further, and properly document and catalog the Data Processes (SQL commands, other ETL tooling) and map the data movement thru those Processes. This will allow users to identify with the Data Process and even allow one to include operational data (as is supported for IGC). More information on this process can be found here: https://www-01.ibm.com/support/docview.wss?uid=swg21699130
Suggest to review the absolute requirements, and what information is required for the ensuing traceability report. Starting with the Extension Mapping Document should suffice, and would be the simplest to implement and drive immediate benefit.

Related

Data Quality Framework in AWS

I am trying to implement a data quality framework for an application which ingests data from various systems(batch, near real time, real time). Few items that I want to highlight here are:
The data pipelines widely vary and ingest very high volumes of data. They are developed using spark,python,emr clusters, kafka, Kinesis stream
Any new system that we onboard in the framework, it should be easily
able to include the data quality checks with minimal coding. so some sort of metadata framework might help for ex: storing the business rules in dynamodb which can automatically run check different feeders/new data pipeline created
Our tech stack includes AWS,Python,Spark, Java, so kindly advise related services(AWS Databrew, PyDeequ, Greatexpectations libraries, various lambda event driven services are some I want to focus)
I am also looking for some sort of audit, balance and control mechanism. Auditing the source data, balancing # of records between 2 points and have some automated mechanism to remediate(control) them.
I am looking for testing frameworks for the different data pipelines.
Also for data profiling, kindly advise tools/libraries, Aws data brew, Pandas are some I am exploring.
I know there wont be one specific solution, and hence appreciate all and any different ideas. A flow diagram with Audit, balance and control with automated data validation and testing mechanism for data pipelines can be very helpful.
Thanks!!!

Best way to ingest data to bigquery

I have heterogeneous sources like flat files residing on prem, json on share point, api which serves data so and so. Which is the best etl tool to bring data to bigquery environment ?
Im a kinder garden student in GCP :)
Thanks in advance
There are many solutions to achieve this. It depends on several factors some of which are:
frequency of data ingestion
whether or not the data needs to be
manipulated before being written into bigquery (your files may not
be formatted correctly)
is this going to be done manually or is this going to be automated
size of the data being written
If you are just looking for an ETL tool you can find many. If you plan to scale this to many pipelines you might want to look at a more advanced tool like Airflow but if you just have a few one-off processes you could set up a Cloud Function within GCP to accomplish this. You can schedule it (via cron), invoke it through HTTP endpoint, or pub/sub. You can see an example of how this is done here
After several tries and datalake/datawarehouse design and architecture, I can recommend you only 1 thing: ingest your data as soon as possible in BigQuery; no matter the format/transformation.
Then, in BigQuery, perform query to format, clean, aggregate, value your data. It's not ETL, it's ELT: you start by loading your data and then you transform them.
It's quicker, cheaper, simpler, and only based on SQL.
It works only if you use ONLY BigQuery as destination.
If you are starting from scratch and have no legacy tools to carry with you, the following GCP managed products target your use case:
Cloud Data Fusion, "a fully managed, code-free data integration service that helps users efficiently build and manage ETL/ELT data pipelines"
Cloud Composer, "a fully managed data workflow orchestration service that empowers you to author, schedule, and monitor pipelines"
Dataflow, "a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing"
(Without considering a myriad of data integration tools and fully customized solutions using Cloud Run, Scheduler, Workflows, VMs, etc.)
Choosing one depends on your technical skills, real-time processing needs, and budget. As mentioned by Guillaume Blaquiere, if BigQuery is your only destination, you should try to leverage BigQuery's processing power on your data transformation.

Updating all entities of KIND in Google Cloud Datastore

we have a dataset of ~10 million entities or a certain Kind in Datastore. We want to change the products functionality, so we would like to change the fields on all Kind entities.
Is there a smart/quick way to do it, that does not involve iterating over all of the entities in series?
Probably you can use Dataflow to help you with your problem.
Dataflow is a stream and batch data processing service, fully managed by GCP.
It was open sourced in the Apache Beam project. It is fully compatible with this SDK. This allows you to test your developments locally before run them on GCP.
It exposes two main concepts, a PCollection, basically the data that is being handled by the tool, and pipelines, the different steps necessary to capture the data, the transformations that must be performed, and how and where the results obtained should be written.
It provides support for Java, Python and Go, and a rich feature set and variety of possible data sources and transformations.
In the specific case of Datastore, Dataflow provides support for read, write and delete data. See for instance the relevant documentation for Python.
You can see a good example of how to interact with datastore in the Apache Beam Github repository.
These two other articles could be also interesting: 1 2.
I would presume that you have to loop through each one and update it as it's a NoSQL data store like mongo from what I can see. We have a system that uses SQL and Mongo and the demoralised data is a pain, we had to write migrations that would loop through all and update.

Google Cloud Dataflow - is it possible to define a pipeline that reads data from BigQuery and writes to an on-premise database?

My organization plans to store a set of data in BigQuery and would like to periodically extract some of that data and bring it back to an on-premise database. In reviewing what I've found online about Dataflow, the most common examples involve moving data in the other direction - from an on-premise database into the cloud. Is it possible to use Dataflow to bring data back out of the cloud to our systems? If not, are there other tools that are better suited to this task?
Abstractly, yes. If you've got a set of sources and syncs and you want to move data between them with some set of transformations, then Beam/Dataflow should be perfectly suitable for the task. It sounds like you're discussing a batch-based periodic workflow rather than a continuous streaming workflow.
In terms of implementation effort, there's more questions to consider. Does an appropriate Beam connector exist for your intended on-premise database? You can see the built-in connectors here: https://beam.apache.org/documentation/io/built-in/ (note the per-language SDK toggle at top of page)
Do you need custom transformations? Are you combining data from systems other than just BigQuery? Either implies to me that you're on the right track with Beam.
On the other hand, if your extract process is relatively straightforward (e.g. just run a query once a week and extract it), you may find there are simpler solutions, particularly if you're not moving much data and your database can ingest data in one of the BigQuery export formats.

How are you coping up with Bigquery especially when you came from traditional RDMS background like Oracle/Mysql?

I am new to BQ. I have a table with around 200 columns, when i wanted to get DDL of this table there is no ready-made option available. CATS is not always desirable.. some times we dont have a refernce table to create with CATS, some times we just wanted a simple DDL statement to recreate a table.
I wanted to edit a schema of bigquery with changes to mode.. previous mode is nullable now its required.. (already loaded columns has this column loaded with non-null values till now)
Looking at all these scenarios and the lengthy solution provided from Google documentation, and also no direct solution interms of SQL statements rather some API calls/UI/Scripts etc. I feel not impressed with Bigquery with many limitations. And the Web UI from Google Bigquery is so small that you need to scroll lot many times to see the query as a whole. and many other Web UI issues as you know.
Just wanted to know how you are all handling/coping up with BQ.
I would like to elaborate a little bit more to #Pentium10 and #guillaume blaquiere comments.
BigQuery is a serverless, highly scalable data warehouse that comes with a built-in query engine, which is capable of running SQL queries on terabytes of data in a matter of seconds, and petabytes in only minutes. You get this performance without having to manage any infrastructure.
BigQuery is based on Google's column based data processing technology called dremel and is able to run queries against up to 20 different data sources and 200GB of data concurrently. Prediction API allows users to create and train a model hosted within Google’s system. The API recognizes historical patterns to make predictions about patterns in new data.
BigQuery is unlike anything that has been used as a big data tool. Nothing seems to compare to the speed and the amount of data that can be fitted into BigQuery. Data views are possible and recommended with basic data visualization tools.
This product typically comes at the end of the Big Data pipeline. It is not a replacement for existing technologies but it complements them. Real-time streams representing sensor data, web server logs or social media graphs can be ingested into BigQuery to be queried in real time. After running the ETL jobs on traditional RDBMS, the resultant data set can be stored in BigQuery. Data can be ingested from the data sets stored in Google Cloud Storage, through direct file import or through streaming data
I recommend you to have a look for Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale book about BigQuery that includes walkthrough on how to use the service and a deep dive of how it works.
More than that, I found really interesting article for Data Engineers new to BigQuery, where you can find consideration regarding DDL and UI and best practices on Medium.
I hope you find the above pieces of information useful.