I know this is a very different use case for Elasticsearch and I need your help.
Main structure (can't be changed):
There are some physical machines and we have sensors there. Data from
these sensors are going to AWS Greengrass.
Then, with Lambda function data are going to Elasticsearch by using
MQTT. Elasticsearch is running on the docker.
This is the structure and until here everything is ready and running โ
Now, on the top of the ES I need some software that can send this data by using MQTT to Cloud database, for example DynamoDB.
But this is not one time migration. It should send the data continuously. Basically, I need a channel between ES and AWS DynamoDB.
Also, sensors are producing so much data and we don't want to store all of them in the Cloud but we want to store them in ES. Some filtering is needed in the Elasticsearch side before we send data to Cloud. Like "save every 10th data to cloud" so we can only save 1 data out of 10.
Do you have any idea about how can it be done? I have no experience in this field and it looks like a challenging task. I would love to get some suggestions from experienced people in these areas.
Thanks a lot! ๐๐
I havenโt worked on a similar use case but you can try looking into Logstash for this.
It's an open source service, part of ELK stack and provides the option of filtering the output. The pipeline will look something link below:
data ----> ES ----> Logstash -----> DynamoDB or any other destination.
It supports various plugins required for your use case, like:
DynamoDB output plugin -
https://github.com/tellapart/logstash-output-dynamodb
Logstash MQTT Output Plugin -
https://github.com/kompa3/logstash-output-mqtt
Related
I am working on connecting a Raspberry Pi (3B+) to Google Cloud and send sensor's data to Google IoT Core. But I couldn't find any content in this matter. I will be so thankful, if anyone would help me, in dealing with the same.
PS: I have already followed the interactive tutorial from Google Cloud itself and connected a simulated virtual device to Cloud and sent data. I am really looking for a tutorial, that helps me in connecting physical Raspberry Pi.
Thank you
You may want to try following along with this community article covering pretty much exactly what you're asking.
The article covers the following steps:
Creating a registry for your gateway device (the Raspberry Pi)
Adding a temperature / humidity sensor
Adding a light
Connecting the devices to Cloud IoT Core
Sending the data from the sensors to Google Cloud
Pulling the data back using PubSub
Create a Registry in Google Cloud IoT Core and setup devices and their public/private key pairs.
You will also have to setup PubSub topics for publishing device telemetry and state events while creating IoT Core Registries.
Once that is done, you can create a Streaming pipeline in Cloud Dataflow that will read data from the pubsub subscriber and sink it in Big Query (Relational Data Warehouse) or Big Table (No-SQL Data Warehouse).
Dataflow is managed service of Apache Beam where you can create and deploy pipelines written in JAVA or Python.
If you are not familiar with coding, you can use Data Fusion that will help you write your ETL's using drag and drop functionalities similar to Talend.
You can create Data Fusion instance in order to create Streaming ETL pipeline. The source will be pubsub and sink will be Big Query or Big Table based on your use case.
For reference:
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming
This link will guide you how to deploy google provided dataflow template from pubsub to big query.
For your own custom pipeline, you can take help fron the github link of pipeline code.
In reference to my previous question, I got my boss to go ahead and let me set up a DMS from my existing postgres to our new redshift db for our analytics team.
The next issue that I am having, and after spending 3 days doing searching on this has provided nothing to help me with this. My boss wants to use Kinesis to pull real-time data from our PG db to our RS db so our analytics team can pull data in real time from it. I'm trying to get this configured and I'm running into nothing but headaches.
I have a Stream set up, Firehose set up to grab from our S3 bucket that I created called "postgres-stream-bucket", but I'm not sure how to get data to dump into it from PG, and then making sure that RS picks everything up and uses it, in real time.
However, if there are better options I would love to hear them, but it is imperative that we have real-time (or as close as possible) translated data.
Amazon Kinesis Firehose is ideal if you have streaming data coming into your systems. It will collect the records, batch them and load them into Redshift. However, it is not an ideal solution for what you have described, where your source is a database rather than random streams of data.
Since you already have the Database Migration Service setup, you can continue to use it for continuous data replication between PostgreSQL and Redshift. This would be the simplest and most effective solution.
All the examples i've seen are with Java programs?
I want to be able to track the a user's behaviour while navigating my website by looking at all the API calls made by that user. All the API calls are based on data stored in a SQL database.
I also for example want to check all the keywords passed to my search API to have a list of most search terms.
I thought about using Oozie but does anyone have any other suggestions ?
There are several option for analyzing the data in your database.
Normal SQL experimentation
I'd suggest starting with normal SQL statements against your database to experiment with finding what data is of interest. This might be a little slow if you have millions of records, but gives you full flexibility to play around with the data.
Amazon EMR
Once you have identified the types of analysis you'd like to run on a regular basis (eg daily or weekly), you could launch an EMR cluster to perform analysis. Please note that this is a powerful but rather complex toolset and the time required to fully utilize it might not be worthwhile.
You can launch a transient cluster, which means that the cluster terminates once it has finished the jobs it has been given. Thus, the cluster can be triggered via a scheduled API call and will automatically terminate.
Amazon Athena
Amazon Athena provides an SQL interface to data stored in Amazon S3. The common use-case is to analyze log files that are in S3 without having to load them into a database. Athena is powerful and processes data in parallel to give results back very quickly.
Bottom line: Start simple. Play with the existing data to figure out what you'd like to discover. Then optimize.
I am starting with AWS IoT service with Raspberry Pi as a device. And I do not understand how I can make guarantee delivery for my data to AWS IoT MQTT service.
There are two cases:
The device has no Internet connection but powered on. In this case, I can use in-memory store (offline queue from AWS SDK library).
The device is powered off. In this case, I am losing my data in RAM.
How to save my data without running some database engine on Raspberry.
Do you have some best practices?
You will need to somehow save your data to disk to mitigate issue #2. The best practice is to use an established database system. SqLite is a very lightweight database. They are not that hard to use, give it a shot! If you really hate that idea, you could just save the data in json format to a text file. That works as well.
I am trying to build a data warehouse using RedShift in AWS. I want preprocess salesforce data my moving it to RDS or S3(use them as stage) before finally moving it to RedShift. I am trying to find out what are different ways on how I can replicate salesforce data in S3/RDS for this purpose. I have seen many third party tools which are able to do this.But I am looking for something which can be built in-house. I would like to use this data for dimensional modeling.
Thanks for your help!