How to send sensor data (like temperature data from DHT11 sensor) to Google Cloud IoT Core and store it - google-cloud-platform

I am working on connecting a Raspberry Pi (3B+) to Google Cloud and send sensor's data to Google IoT Core. But I couldn't find any content in this matter. I will be so thankful, if anyone would help me, in dealing with the same.
PS: I have already followed the interactive tutorial from Google Cloud itself and connected a simulated virtual device to Cloud and sent data. I am really looking for a tutorial, that helps me in connecting physical Raspberry Pi.
Thank you

You may want to try following along with this community article covering pretty much exactly what you're asking.
The article covers the following steps:
Creating a registry for your gateway device (the Raspberry Pi)
Adding a temperature / humidity sensor
Adding a light
Connecting the devices to Cloud IoT Core
Sending the data from the sensors to Google Cloud
Pulling the data back using PubSub

Create a Registry in Google Cloud IoT Core and setup devices and their public/private key pairs.
You will also have to setup PubSub topics for publishing device telemetry and state events while creating IoT Core Registries.
Once that is done, you can create a Streaming pipeline in Cloud Dataflow that will read data from the pubsub subscriber and sink it in Big Query (Relational Data Warehouse) or Big Table (No-SQL Data Warehouse).
Dataflow is managed service of Apache Beam where you can create and deploy pipelines written in JAVA or Python.
If you are not familiar with coding, you can use Data Fusion that will help you write your ETL's using drag and drop functionalities similar to Talend.
You can create Data Fusion instance in order to create Streaming ETL pipeline. The source will be pubsub and sink will be Big Query or Big Table based on your use case.

For reference:
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming
This link will guide you how to deploy google provided dataflow template from pubsub to big query.
For your own custom pipeline, you can take help fron the github link of pipeline code.

Related

Ingest RDBMS data to BigQuery

If we have an on-prem sources like SQL-Server and Oracle. Data from it has to be ingested periodically in batch mode in Big Query. What shud be the architecture? Which GCP native services can be used for this? Can Dataflow or DataProc be used?
PS: Our organization haven't licensed any third-party ETL tool so far. Preference is for google native service. Data Fusion is very expensive.
There are two approaches you can take with Apache Beam.
Periodically run a Beam/Dataflow batch job on your database. You could use Beam's JdbcIO connector to read data. After that you can transform your data using Beam transforms (PTransforms) and write to the destination using a Beam sink. In this approach, you are responsible for handling duplicate data (for example, by providing different SQL queries across executions).
Use a Beam/Dataflow pipeline that can read change streams from a database. The simplest approach here might be using one of the available Dataflow templates. For example, see here. You can also develop your own pipeline using Beam's DebeziumIO connector.

Sending Data From Elasticsearch to AWS Databases in Real Time

I know this is a very different use case for Elasticsearch and I need your help.
Main structure (can't be changed):
There are some physical machines and we have sensors there. Data from
these sensors are going to AWS Greengrass.
Then, with Lambda function data are going to Elasticsearch by using
MQTT. Elasticsearch is running on the docker.
This is the structure and until here everything is ready and running ✅
Now, on the top of the ES I need some software that can send this data by using MQTT to Cloud database, for example DynamoDB.
But this is not one time migration. It should send the data continuously. Basically, I need a channel between ES and AWS DynamoDB.
Also, sensors are producing so much data and we don't want to store all of them in the Cloud but we want to store them in ES. Some filtering is needed in the Elasticsearch side before we send data to Cloud. Like "save every 10th data to cloud" so we can only save 1 data out of 10.
Do you have any idea about how can it be done? I have no experience in this field and it looks like a challenging task. I would love to get some suggestions from experienced people in these areas.
Thanks a lot! 🙌😊
I haven’t worked on a similar use case but you can try looking into Logstash for this.
It's an open source service, part of ELK stack and provides the option of filtering the output. The pipeline will look something link below:
data ----> ES ----> Logstash -----> DynamoDB or any other destination.
It supports various plugins required for your use case, like:
DynamoDB output plugin -
https://github.com/tellapart/logstash-output-dynamodb
Logstash MQTT Output Plugin -
https://github.com/kompa3/logstash-output-mqtt

What is the best way to feed data from Oracle Golden Gate to Big Query

I am looking forward to streaming data into BQ from OGG. What is the best way to do it? Can data from Oracle Golden Gate be streamed directly into BigQuery or do we need some connectors?
Please suggest the best possible way to do it?
You need a connector to bridge the 2 worlds:
Either a bridge that convert OGG event into PubSub messages, and then process the PubSub message in streaming (with Dataflow or with Cloud Run/Cloud Functions if the data volume and rate are acceptable).
Or listen OGG with a compute (On prem or on GCP) and for each message, stream write the data into BigQuery.
as a heads-up, the GoldenGate product is natively certified to deliver data into Google Big Query, the full list of big data certifications for GG 19c is here: https://www.oracle.com/technetwork/middleware/ogg-19-1-0-0-0-cert-matrix-5491855.xls
(note; you can replicate data from Oracle DB 11.2.0.4 and higher, using OGG 12.3 and higher, into GG Big Data 19c for Big Query)
Oracle GoldenGate has a native adapter to deliver to Google BigQuery. It is been there in market since Sept 2018 and used successfully by many companies.
Refer to https://blogs.oracle.com/dataintegration/goldengate-for-big-data-123211-release-update
I think Google Pub/Sub is not the right interface to deliver large volume of data into cloud. Google Pub/Sub has limitations on throughput.
See this link https://cloud.google.com/pubsub/quotas.
Please note that GoldenGate does not have any true limitations or boundaries like Google Pub/Sub or AWS Kinesis. GoldenGate is a true scalable product.
Use cases of failure using Google Pub/Sub could be as follows:
What happens if the input speed is more than 5MB/sec? Pub/Sub will deny
data and how should that failure be handled?
One might argue that they will create multiple topics, but what if there is a main table that is creating more than 5MB/sec and you don't want to segregate into different topics?

Planning an architecture in GCP

I want to plan an architecture based on GCP cloud platform. Below are the subject areas what I have to cover. Can someone please help me to find out the proper services which will perform that operation?
Data ingestion (Batch, Real-time, Scheduler)
Data profiling
AI/ML based data processing
Analytical data processing
Elastic search
User interface
Batch and Real-time publish
Security
Logging/Audit
Monitoring
Code repository
If I am missing something which I have to take care then please add the same too.
GCP offers many products with functionality that can overlap partially. What product to use would depend on the more specific use case, and you can find an overview about it here.
That being said, an overall summary of the services you asked about would be:
1. Data ingestion (Batch, Real-time, Scheduler)
That will depend on where your data comes from, but the most common options are Dataflow (both for batch and streaming) and Pub/Sub for streaming messages.
2. Data profiling
Dataprep (which actually runs on top of Dataflow) can be used for data profiling, here is an overview of how you can do it.
3. AI/ML based data processing
For this, you have several options depending on your needs. For developers with limited machine learning expertise there is AutoML that allows to quickly train and deploy models. For more experienced data scientists there is ML Engine, that allows training and prediction of custom models made with frameworks like TensorFlow or scikit-learn.
Additionally, there are some pre-trained models for things like video analysis, computer vision, speech to text, speech synthesis, natural language processing or translation.
Plus, it’s even possible to perform some ML tasks in GCP’s data warehouse, BigQuery in SQL language.
4. Analytical data processing
Depending on your needs, you can use Dataproc, which is a managed Hadoop and Spark service, or Dataflow for stream and batch data processing.
BigQuery is also designed with analytical operations in mind.
5. Elastic search
There is no managed Elastic search service directly provided by GCP, but you can find several options on the marketplace, like an API service or a Kubernetes app for Google’s Kubernetes Engine.
6. User interface
If you are referring to a user interface for your own use, GCP’s console is what you’d be using. If you are referring to a UI for end-users, I’d suggest using App Engine.
If you are referring to a UI for data exploration, there is Datalab, which is essentially a managed notebook service, and Data Studio, where you can build plots of your data in real time.
7. Batch and Real-time publish
The publishing service in GCP, for both synchronous and asynchronous messages is Pub/Sub.
8. Security
Most security concerns in GCP are addressed here. Which is a wide topic by itself and should probably need a separate question.
9. Logging/Audit
GCP uses Stackdriver for logging of most of its products, and provides many ways to process and analyze those logs.
10. Monitoring
Stackdriver also has monitoring features.
11. Code repository
For this there is Cloud Source Repositories, which integrate with GCP’s automated build system and can also be easily synched with a Github repository.
12. Analytical data warehouse
You did not ask for this one, but I think it's an important part of a data analysis stack.
In the case of GCP, this would be BigQuery.

Process data on Firebase of Google Cloud platform

I hope to upload data from Android app to Google Cloud Platform and do some basic machine learning/statistic operations. I have used firebase and upload the generated data on Android app to 'realtime database' on firebase of Google cloud platform. My next goal is to do some data processing, such as simple statistic and machine learning operations, I do not know how 'realtime database' could support these operations? If not, it seems Google Cloud Platform can do such operations in MySQL, how I transfer the data in 'realtime database' on firebase on My SQL? I am the fresh guy in GCP, hope get a clear direction. Thank you
You can use firebase_admin library to access Realtime Database data. Then you can either store it using one of many google cloud client libraries or use it directly in ML Job.