Ingest RDBMS data to BigQuery - google-cloud-platform

If we have an on-prem sources like SQL-Server and Oracle. Data from it has to be ingested periodically in batch mode in Big Query. What shud be the architecture? Which GCP native services can be used for this? Can Dataflow or DataProc be used?
PS: Our organization haven't licensed any third-party ETL tool so far. Preference is for google native service. Data Fusion is very expensive.

There are two approaches you can take with Apache Beam.
Periodically run a Beam/Dataflow batch job on your database. You could use Beam's JdbcIO connector to read data. After that you can transform your data using Beam transforms (PTransforms) and write to the destination using a Beam sink. In this approach, you are responsible for handling duplicate data (for example, by providing different SQL queries across executions).
Use a Beam/Dataflow pipeline that can read change streams from a database. The simplest approach here might be using one of the available Dataflow templates. For example, see here. You can also develop your own pipeline using Beam's DebeziumIO connector.

Related

Best way to ingest data to bigquery

I have heterogeneous sources like flat files residing on prem, json on share point, api which serves data so and so. Which is the best etl tool to bring data to bigquery environment ?
Im a kinder garden student in GCP :)
Thanks in advance
There are many solutions to achieve this. It depends on several factors some of which are:
frequency of data ingestion
whether or not the data needs to be
manipulated before being written into bigquery (your files may not
be formatted correctly)
is this going to be done manually or is this going to be automated
size of the data being written
If you are just looking for an ETL tool you can find many. If you plan to scale this to many pipelines you might want to look at a more advanced tool like Airflow but if you just have a few one-off processes you could set up a Cloud Function within GCP to accomplish this. You can schedule it (via cron), invoke it through HTTP endpoint, or pub/sub. You can see an example of how this is done here
After several tries and datalake/datawarehouse design and architecture, I can recommend you only 1 thing: ingest your data as soon as possible in BigQuery; no matter the format/transformation.
Then, in BigQuery, perform query to format, clean, aggregate, value your data. It's not ETL, it's ELT: you start by loading your data and then you transform them.
It's quicker, cheaper, simpler, and only based on SQL.
It works only if you use ONLY BigQuery as destination.
If you are starting from scratch and have no legacy tools to carry with you, the following GCP managed products target your use case:
Cloud Data Fusion, "a fully managed, code-free data integration service that helps users efficiently build and manage ETL/ELT data pipelines"
Cloud Composer, "a fully managed data workflow orchestration service that empowers you to author, schedule, and monitor pipelines"
Dataflow, "a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing"
(Without considering a myriad of data integration tools and fully customized solutions using Cloud Run, Scheduler, Workflows, VMs, etc.)
Choosing one depends on your technical skills, real-time processing needs, and budget. As mentioned by Guillaume Blaquiere, if BigQuery is your only destination, you should try to leverage BigQuery's processing power on your data transformation.

Updating all entities of KIND in Google Cloud Datastore

we have a dataset of ~10 million entities or a certain Kind in Datastore. We want to change the products functionality, so we would like to change the fields on all Kind entities.
Is there a smart/quick way to do it, that does not involve iterating over all of the entities in series?
Probably you can use Dataflow to help you with your problem.
Dataflow is a stream and batch data processing service, fully managed by GCP.
It was open sourced in the Apache Beam project. It is fully compatible with this SDK. This allows you to test your developments locally before run them on GCP.
It exposes two main concepts, a PCollection, basically the data that is being handled by the tool, and pipelines, the different steps necessary to capture the data, the transformations that must be performed, and how and where the results obtained should be written.
It provides support for Java, Python and Go, and a rich feature set and variety of possible data sources and transformations.
In the specific case of Datastore, Dataflow provides support for read, write and delete data. See for instance the relevant documentation for Python.
You can see a good example of how to interact with datastore in the Apache Beam Github repository.
These two other articles could be also interesting: 1 2.
I would presume that you have to loop through each one and update it as it's a NoSQL data store like mongo from what I can see. We have a system that uses SQL and Mongo and the demoralised data is a pain, we had to write migrations that would loop through all and update.

Google Cloud Dataflow - is it possible to define a pipeline that reads data from BigQuery and writes to an on-premise database?

My organization plans to store a set of data in BigQuery and would like to periodically extract some of that data and bring it back to an on-premise database. In reviewing what I've found online about Dataflow, the most common examples involve moving data in the other direction - from an on-premise database into the cloud. Is it possible to use Dataflow to bring data back out of the cloud to our systems? If not, are there other tools that are better suited to this task?
Abstractly, yes. If you've got a set of sources and syncs and you want to move data between them with some set of transformations, then Beam/Dataflow should be perfectly suitable for the task. It sounds like you're discussing a batch-based periodic workflow rather than a continuous streaming workflow.
In terms of implementation effort, there's more questions to consider. Does an appropriate Beam connector exist for your intended on-premise database? You can see the built-in connectors here: https://beam.apache.org/documentation/io/built-in/ (note the per-language SDK toggle at top of page)
Do you need custom transformations? Are you combining data from systems other than just BigQuery? Either implies to me that you're on the right track with Beam.
On the other hand, if your extract process is relatively straightforward (e.g. just run a query once a week and extract it), you may find there are simpler solutions, particularly if you're not moving much data and your database can ingest data in one of the BigQuery export formats.

Can Google Data Fusion make the same data cleaning than DataPrep?

I want to run a machine learning model with some data. Before train the model with this data I need to process it, so I have been reading some ways to do it.
First of all create a Dataflow pipeline to upload it to Bigquery or Google Cloud Storage, then create a data pipeline with Google Dataprep to clean it.
The other way I reat to do it is with Data Fusion, that can create data pipelines more easier, but I don't know and here is my doubt, data Fusion it is only to create a pipeline like Dataflow and then I have to use DataPrep to clean the data or if Data Fusion can clean the data and prepare it to put into my machine learning model.
If Data Fusion can clean the data as DataPrep, when I should use DataPrep?
Datafusion and Dataprep can perform the same things. However their execution are different.
Datafusion create a Spark pipeline and run it on Dataproc cluster
Dataprep create a Beam pipeline and run it on Dataflow
IMO, Datafusion is more designed for data ingestion from one source to another one, with few transformation.
Dataprep is more designed for data preparation (as its name means), data cleaning, new column creation, splitting column. Dataprep also provide insight of the data for helping you in your recipes.
In addition, Beam is a part of Tensorflow extended and your Data engineer pipeline will be more consistent if you use a tool compliant with Beam
That's why I will recommend Dataprep instead Datafusion.

Planning an architecture in GCP

I want to plan an architecture based on GCP cloud platform. Below are the subject areas what I have to cover. Can someone please help me to find out the proper services which will perform that operation?
Data ingestion (Batch, Real-time, Scheduler)
Data profiling
AI/ML based data processing
Analytical data processing
Elastic search
User interface
Batch and Real-time publish
Security
Logging/Audit
Monitoring
Code repository
If I am missing something which I have to take care then please add the same too.
GCP offers many products with functionality that can overlap partially. What product to use would depend on the more specific use case, and you can find an overview about it here.
That being said, an overall summary of the services you asked about would be:
1. Data ingestion (Batch, Real-time, Scheduler)
That will depend on where your data comes from, but the most common options are Dataflow (both for batch and streaming) and Pub/Sub for streaming messages.
2. Data profiling
Dataprep (which actually runs on top of Dataflow) can be used for data profiling, here is an overview of how you can do it.
3. AI/ML based data processing
For this, you have several options depending on your needs. For developers with limited machine learning expertise there is AutoML that allows to quickly train and deploy models. For more experienced data scientists there is ML Engine, that allows training and prediction of custom models made with frameworks like TensorFlow or scikit-learn.
Additionally, there are some pre-trained models for things like video analysis, computer vision, speech to text, speech synthesis, natural language processing or translation.
Plus, it’s even possible to perform some ML tasks in GCP’s data warehouse, BigQuery in SQL language.
4. Analytical data processing
Depending on your needs, you can use Dataproc, which is a managed Hadoop and Spark service, or Dataflow for stream and batch data processing.
BigQuery is also designed with analytical operations in mind.
5. Elastic search
There is no managed Elastic search service directly provided by GCP, but you can find several options on the marketplace, like an API service or a Kubernetes app for Google’s Kubernetes Engine.
6. User interface
If you are referring to a user interface for your own use, GCP’s console is what you’d be using. If you are referring to a UI for end-users, I’d suggest using App Engine.
If you are referring to a UI for data exploration, there is Datalab, which is essentially a managed notebook service, and Data Studio, where you can build plots of your data in real time.
7. Batch and Real-time publish
The publishing service in GCP, for both synchronous and asynchronous messages is Pub/Sub.
8. Security
Most security concerns in GCP are addressed here. Which is a wide topic by itself and should probably need a separate question.
9. Logging/Audit
GCP uses Stackdriver for logging of most of its products, and provides many ways to process and analyze those logs.
10. Monitoring
Stackdriver also has monitoring features.
11. Code repository
For this there is Cloud Source Repositories, which integrate with GCP’s automated build system and can also be easily synched with a Github repository.
12. Analytical data warehouse
You did not ask for this one, but I think it's an important part of a data analysis stack.
In the case of GCP, this would be BigQuery.