What is the best way to feed data from Oracle Golden Gate to Big Query - google-cloud-platform

I am looking forward to streaming data into BQ from OGG. What is the best way to do it? Can data from Oracle Golden Gate be streamed directly into BigQuery or do we need some connectors?
Please suggest the best possible way to do it?

You need a connector to bridge the 2 worlds:
Either a bridge that convert OGG event into PubSub messages, and then process the PubSub message in streaming (with Dataflow or with Cloud Run/Cloud Functions if the data volume and rate are acceptable).
Or listen OGG with a compute (On prem or on GCP) and for each message, stream write the data into BigQuery.

as a heads-up, the GoldenGate product is natively certified to deliver data into Google Big Query, the full list of big data certifications for GG 19c is here: https://www.oracle.com/technetwork/middleware/ogg-19-1-0-0-0-cert-matrix-5491855.xls
(note; you can replicate data from Oracle DB 11.2.0.4 and higher, using OGG 12.3 and higher, into GG Big Data 19c for Big Query)

Oracle GoldenGate has a native adapter to deliver to Google BigQuery. It is been there in market since Sept 2018 and used successfully by many companies.
Refer to https://blogs.oracle.com/dataintegration/goldengate-for-big-data-123211-release-update
I think Google Pub/Sub is not the right interface to deliver large volume of data into cloud. Google Pub/Sub has limitations on throughput.
See this link https://cloud.google.com/pubsub/quotas.
Please note that GoldenGate does not have any true limitations or boundaries like Google Pub/Sub or AWS Kinesis. GoldenGate is a true scalable product.
Use cases of failure using Google Pub/Sub could be as follows:
What happens if the input speed is more than 5MB/sec? Pub/Sub will deny
data and how should that failure be handled?
One might argue that they will create multiple topics, but what if there is a main table that is creating more than 5MB/sec and you don't want to segregate into different topics?

Related

Can Google Dataflow connect to API data source and insert data into Big Query

We are exploring few use cases where we might have to ingest data generated by the SCADA/PIMS devices.
For security reason, we are not allowed to directly connect to OT devices or datasources. Hence, this data has REST APIs which can be used to consume the data.
Please suggest if Dataflow or any other service from GCP can be used to capture this data and put it into Big Query or any other relevant target service.
If possible, please share any relevant documentation/link around such requirements.
Yes!
Here is what you need to know: when you write an Apache Beam pipeline, your processing logic lives in DoFn that you create. These functions can call any logic you want. If your data source is unbounded or just big, then you will author a "splittable DoFn" that can be read by multiple worker machines in parallel and checkpointed. You will need to figure out how to provide exactly-once ingestion from your REST API and how to not overwhelm your service; that is usually the hardest part.
That said, you may wish to use a different approach, such as pushing the data into Cloud Pubsub first. Then you would use Cloud Dataflow to read the data from Cloud Pubsub. This will provide a natural scalable queue between your devices and your data processing.
You can capture data with PubSub and direct it to be processed in Dataflow and then saved into BigQuery (or storage), with a specific IO connector.
Stream messages from Pub/Sub by using Dataflow:
https://cloud.google.com/pubsub/docs/stream-messages-dataflow
Google-provided streaming templates (for Dataflow): PubSub->Dataflow->BigQuery:
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming
Whole solution:
https://medium.com/codex/a-dataflow-journey-from-pubsub-to-bigquery-68eb3270c93

Best way to ingest data to bigquery

I have heterogeneous sources like flat files residing on prem, json on share point, api which serves data so and so. Which is the best etl tool to bring data to bigquery environment ?
Im a kinder garden student in GCP :)
Thanks in advance
There are many solutions to achieve this. It depends on several factors some of which are:
frequency of data ingestion
whether or not the data needs to be
manipulated before being written into bigquery (your files may not
be formatted correctly)
is this going to be done manually or is this going to be automated
size of the data being written
If you are just looking for an ETL tool you can find many. If you plan to scale this to many pipelines you might want to look at a more advanced tool like Airflow but if you just have a few one-off processes you could set up a Cloud Function within GCP to accomplish this. You can schedule it (via cron), invoke it through HTTP endpoint, or pub/sub. You can see an example of how this is done here
After several tries and datalake/datawarehouse design and architecture, I can recommend you only 1 thing: ingest your data as soon as possible in BigQuery; no matter the format/transformation.
Then, in BigQuery, perform query to format, clean, aggregate, value your data. It's not ETL, it's ELT: you start by loading your data and then you transform them.
It's quicker, cheaper, simpler, and only based on SQL.
It works only if you use ONLY BigQuery as destination.
If you are starting from scratch and have no legacy tools to carry with you, the following GCP managed products target your use case:
Cloud Data Fusion, "a fully managed, code-free data integration service that helps users efficiently build and manage ETL/ELT data pipelines"
Cloud Composer, "a fully managed data workflow orchestration service that empowers you to author, schedule, and monitor pipelines"
Dataflow, "a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing"
(Without considering a myriad of data integration tools and fully customized solutions using Cloud Run, Scheduler, Workflows, VMs, etc.)
Choosing one depends on your technical skills, real-time processing needs, and budget. As mentioned by Guillaume Blaquiere, if BigQuery is your only destination, you should try to leverage BigQuery's processing power on your data transformation.

Google Cloud Dataflow - is it possible to define a pipeline that reads data from BigQuery and writes to an on-premise database?

My organization plans to store a set of data in BigQuery and would like to periodically extract some of that data and bring it back to an on-premise database. In reviewing what I've found online about Dataflow, the most common examples involve moving data in the other direction - from an on-premise database into the cloud. Is it possible to use Dataflow to bring data back out of the cloud to our systems? If not, are there other tools that are better suited to this task?
Abstractly, yes. If you've got a set of sources and syncs and you want to move data between them with some set of transformations, then Beam/Dataflow should be perfectly suitable for the task. It sounds like you're discussing a batch-based periodic workflow rather than a continuous streaming workflow.
In terms of implementation effort, there's more questions to consider. Does an appropriate Beam connector exist for your intended on-premise database? You can see the built-in connectors here: https://beam.apache.org/documentation/io/built-in/ (note the per-language SDK toggle at top of page)
Do you need custom transformations? Are you combining data from systems other than just BigQuery? Either implies to me that you're on the right track with Beam.
On the other hand, if your extract process is relatively straightforward (e.g. just run a query once a week and extract it), you may find there are simpler solutions, particularly if you're not moving much data and your database can ingest data in one of the BigQuery export formats.

How to send sensor data (like temperature data from DHT11 sensor) to Google Cloud IoT Core and store it

I am working on connecting a Raspberry Pi (3B+) to Google Cloud and send sensor's data to Google IoT Core. But I couldn't find any content in this matter. I will be so thankful, if anyone would help me, in dealing with the same.
PS: I have already followed the interactive tutorial from Google Cloud itself and connected a simulated virtual device to Cloud and sent data. I am really looking for a tutorial, that helps me in connecting physical Raspberry Pi.
Thank you
You may want to try following along with this community article covering pretty much exactly what you're asking.
The article covers the following steps:
Creating a registry for your gateway device (the Raspberry Pi)
Adding a temperature / humidity sensor
Adding a light
Connecting the devices to Cloud IoT Core
Sending the data from the sensors to Google Cloud
Pulling the data back using PubSub
Create a Registry in Google Cloud IoT Core and setup devices and their public/private key pairs.
You will also have to setup PubSub topics for publishing device telemetry and state events while creating IoT Core Registries.
Once that is done, you can create a Streaming pipeline in Cloud Dataflow that will read data from the pubsub subscriber and sink it in Big Query (Relational Data Warehouse) or Big Table (No-SQL Data Warehouse).
Dataflow is managed service of Apache Beam where you can create and deploy pipelines written in JAVA or Python.
If you are not familiar with coding, you can use Data Fusion that will help you write your ETL's using drag and drop functionalities similar to Talend.
You can create Data Fusion instance in order to create Streaming ETL pipeline. The source will be pubsub and sink will be Big Query or Big Table based on your use case.
For reference:
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming
This link will guide you how to deploy google provided dataflow template from pubsub to big query.
For your own custom pipeline, you can take help fron the github link of pipeline code.

Google Dataprep integration whith message brokers

Is it possible to read form Kafka or Google Pub/Sub in a Dataprep Job?
If so, any 'best practice' deployment considerations I should expect when the samples are edited on board an "oh so snappy, live and responsive" a la visual studio (minus the ability to purchase or download the tool) whereas debugging the production flow (same "type" of data") is performed on top of anything but such tools (coding Scala/Java on our favorite IDE)?
There is not a native way to read from a message system, like Kafka or Pub/Sub, directly into Cloud Dataprep.
I'd recommend an alternative approach:
Stream the data into BigQuery and then read the data from BQ
Write the stream data to Cloud Storage and then load the data
Both approaches will require writing the data to an intermediate location beforehand. I'd recommend BQ if you have needs for low latency, performance, or query-ability in the future. I'd recommend GCS for low-cost when speed is not critical.