AWS Stream data from IOT to dashboard graphs - amazon-web-services

We need to get data from 1000s of IOT devices (temperature, pressure, RPM etc total 50+ parameters) and show it on a dashboard without much processing (just checking if numbers are in range otherwise raise alarm) but real time.
I have reviewed and tested many aws blog resources like Kinesis Storm ClickStream App
however I think using storm is an overkill for such an easy task. All I want to do is save the data in DB and show graphs (30 Minute, 1 Hour, or custom date). This is what I have figured so far
Device -> AWS IOT(mqtt) -> Kinesis -> x -> dynamoDB -> Presenter Web APP (Laravel)
I might have to use Node.js and Redis Pub/Sub as mentioned in ClickStream example for real time updates to graphs and alerts.
I don't want to use Apache Storm because it's in Java and have learning curve (and couldn't find any good resource). I know I can use Lambda but not sure how will it scale.
any thoughts on solution ?
AWS don't have KCL for PHP, alternatives or solutions? because I am familiar with PHP but not with Java.

Apache storm is a distributed event processing framework. In your use-case, you do not seem to perform any computation on the events. Basically, your application is doing three tasks:
Ingest data into the system.
Read the data from period X to Y.
Draw graphs on a web frontend.
The ingestion part is taken care by AWS-IOT. The first step you should do is create an SNS topic and publish all IoT data to SNS topics. Here you get the flexibility to create one topic per datatype(ex: temperature, pressure) and attach consumer SQS queues to the topics to accumulate messages into. For a persistent DB, one consumer can be DynamoDB table, another consumer can be a Lambda function which performs some kind of filtering and data transform and updates your cache. If you need to perform some kind of OLAP/Analytical queries on the data, then consider using Redshift as one of the consumers. You will have to get into specific requirements to finalize your design.

Have you considered routing your data to AWS IoT Analytics after receiving the mqtt message in IoT Core? This way you could get rid of all the infrastructure heavy lifting with kinesis, Dynamo and your presentation layer.
AWS IoT Analytics provides you the ingestion, data preparation and querying capabilities. Once you have the data stored in the processed datastore, you can visualize it with AWS QuickSight.

Related

Transfer/Replicate Data periodically from AWS Documentdb to Google Cloud Big Query

We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.
This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.

What is the right architecture\design to perform javascript-client to aws-database website tracking system

We wish to build data pipeline system which tracks website interactions/events.
The goal is to track user behavior in a website so we would like to choose the right architecture to implement it having the following two constraints :
1) the system is Amazon
2) this is budgetary project so we cannot use redshift for this purpose
Based on the above two constraints my plan is to implement the following architecture:
website-javascript --> AWS-S3 -->(AWS-Lambda)--> AWS-RDS
website javascript client -
aws-firehose data delivery system to S3 - tracking user interaction and load them to aws-firehose which eventually write them in aws-S3.
AWS Lambda (Python) - Periodically task which pulls daily events from AWS-S3 and load them to AWS-RDS.
The reason I have chosen AWS-RDS is due to its cost-effectiveness for this objective
Appreciate any comment to the above mentioned implementation or any other architecture proposal that you may recommend to use instead of the above
If I understand your question correctly, you are proposing below solution to perform web analytics for your application:
WebServer --> Firehose --> AWS-S3 --> AWS-Lambda --> AWS-RDS
I see below pros and cons with above design
Pros:
low cost
easy to implement
Cons:
RDS may not be salable enough to handle analytics on massive amounts of web-streaming data, which tend to grow rapidly
Need to handle load balancing, failure scenarios and other complexities for lambda
You need to handle data transformation for RDS as it expects structured data to be ingested into relational tables
Proposal to store the data in S3 through Firehose sounds a good solution. But please keep in mind that minimum interval for Firehose is one minute, so your application needs to tolerate this minor latency. You may use Kinesis Streams to have millisecond latency, but then you need to manage your own application code and instances to handle Streams.
After ingesting data in Kinesis Firehose or Streams, you may also explore below alternatives:
Use Kinesis Analytics to track web users activity in real-time if it's available in your AWS region. It's only available in selected AWS regions currently
Within Firehose, transform your data using lambda and store it in S3 in optimized format for further analysis with AWS Athena
Use Elastic Search as a destination and perform web analytics with ELK stack instead of RDS
Though you mentioned that you can not use RedShift, it still may be the best solution for time series analysis. Exploring RedShift, RedShift Spectrum and formatted data stored in S3 may still be a cost effective solution with better cababilities
Adding few references from AWS, which you may go through before deciding on the solution:
Real-Time Web Analytics with Kinesis Data Analytics Solution
Near Real-time Analytics on Streaming Data with Amazon Kinesis and Amazon Elasticsearch
Schema-On-Read Analytics Pipeline Using Amazon Athena
Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required
Hey folky this is a getting more and more common.
Generally the pattern is click events to Kinesis streams then you can monitor user interaction with the website in real time using Kinesis analytics. You can connect the stream to firehose to offload data in to an S3 bucket as well as incorporate Lambdas to transform the data.
There is some major complexity around handling Lambdas and Kinesis streams in parallel so this solution might not be as scalable as using AWS Kafka. Or perhaps run a job to move your s3 data into rds for whatever reporting you might need that is adhoc.
Here is a pattern AWS already has real-time-web-analytics-with-kinesis

How to handle AWS IOT streaming data in relational database

Generic information :-i am designing solution for one of IOT problem approach in which data is continuously streaming from plc(programmable logic controller),plc have different tags these tags are representation of telemetry data and data will be continuously streaming from these tags, each of devices will have alarm tags which will be 0 or 1 , 1 means there is an equipment failure
problem statement:- i have to read the alarm tag and raise a ticket if any of alarm tag value is 1 and i have to stream these alerts to dashboard and also i have to maintain the ticket history too,so the operator can update the ticket status too
My solution:- i am using aws IOT , i am getting data in dynamo db then i am using dynamo db stream to check if any new item is added in alarm table and if it will trigger lambda function (which i have implemented in java) lambda function opens a new ticket in relational database using hibernate.
problem with my approach:-the aws iot data is continuously streaming in alarm table at a very fast rate and this is opening a lot of connection before it can be closed that's taking my relational database down
please let me know if other good design approach can i adopt?
USE Amazon Kinesis Analytics to process streaming data. Dynamodb isn't suitable for this.
Read more here
Below image will give you an idea for same
Just a proposal....
From lambda, do not contact RDS,
Rather push all alarms in AWS SQS
then you can have one another lambda scheduled for every minute using AWS CloudWatch Rules that will pick all items from AWS SQS and then insert them in RDS at once.
I agree with raevilman's design of not letting Lambda contact RDS directly.
Since creating a new ticket is not the only task you Lambda function is doing, you are also streaming these alerts to a dashboard. Depending on the streaming rate and the RDS limitations, you may want to split these tasks in multiple queues.
Generic solution: I'd suggest you can push the alarm to a fanout exchange and this exchange will in turn push the alarm to one or more queues as required. You can then batch the alarms and perform multiple writes together without performing connect/disconnect cycle multiple times.
AWS specific Solution: I haven't used SQS so can't really comment on it's architecture. Alternatively, you can create an SNS Topic and publish these alarms to this topic. You can then have SQS queues as subscribers to this topic which in turn will be used for Ticketing and Dashboard purpose independent of each other.
Here again, from Ticketing queue, you can poll messages using Lambda or your own scheduler in batch and process tickets(frequency depending on how time critical alarms are).
You may want to read this tutorial to get some pointers.
You can control number of lambda function concurrency. And this will reduce the number of lambdas that get spinned up based on the dynamo events. Thereby reducing the connections to RDS.
https://aws.amazon.com/blogs/compute/managing-aws-lambda-function-concurrency/
Ofcourse , this will throttle the dynamo events.

AWS SQS and other services

my company has a messaging system which sends real-time messages in JSON format, and it's not built on AWS
our team is trying to use AWS SQS to receive these messages, which will then have DynamoDB to storage this messages
im thinking to use EC2 to read this messages then save them
any better solution ?? or how to do it i don't have a good experience
First of All EC2 is infrastructure on Cloud, It is similar to physical machine with OS on local setup. If you want to create any application that will fetch the data from Amazon SQS(Messages in Json Format) and Push it in dynamodb(No Sql database), Your design is correct as both SQS and DynamoDb have thorough Json Support. Once your application is ready then you deploy that application on EC2 machine.
For achieving this, your application must have the asyc Buffered SQS consumer that will consume the messages(limit of sqs messages is 256KB), Hence whichever application is publishing messages size of messages needs to be less thab 256Kb.
Please refer below link for sqs consumer
is putting sqs-consumer to detect receiveMessage event in sqs scalable
Once you had consumed the message from sqs queue you need to save it in dynamodb, that you can easily do it using crud repository. With Repository you can directly save the json in Dynamodb table but please sure to configure the provisioning write capacity based on requests, because more will be the provisioning capacity more will be the cost. Please refer below link for configuring the write capacity of table.
Dynamodb reading and writing units
In general, you'll have a setup something like this:
The EC2 instances (one or more) will read your queue every few seconds to see if there is anything there. If so, they will write this data to DynamoDB.
Based on what you're saying you'll have less than 1,000,000 reads from SQS in a month so you can start out on the free tier for that. You can have a single EC2 instance initially and that can be a very small instance - a T2.micro should be more than sufficient. And you don't need more than a few writes per second on DynamoDB.
The advantage of SQS is that if for some reason your EC2 instance is temporarily unavailable the messages continue to queue up and you won't lose any of them.
From a coding perspective, you don't mention your development environment but there are AWS libraries available for a pretty wide variety of environments. I develop in Java and the code to do this would be maybe 100 lines. I would guess that other languages would be similar. Make sure you look at long polling in the language you're using - it can help to speed up the processing and save you money.

AWS : What's the difference between Simple Workflow Service and Data Pipeline?

What's the difference between Amazon Simple Workflow Service and Amazon Data Pipeline ? It seems that they are pretty much the same product. The Data Pipeline has a nice web based diagram editor though.
Cheers !
From http://aws.amazon.com/datapipeline/faqs/
Q: How is AWS Data Pipeline different from Amazon Simple Workflow
Service?
While both services provide execution tracking, retry and
exception-handling capabilities, and the ability to run arbitrary
actions, AWS Data Pipeline is specifically designed to facilitate the
specific steps that are common across a majority of data-driven
workflows – inparticular, executing activities after their input data
meets specific readiness criteria, easily copying data between
different data stores, and scheduling chained transforms. This highly
specific focus means that its workflow definitions can be created
[with] very rapidly and with no code or programming knowledge.
Data Pipeline is service used to transfer data between various services of AWS. Example you can use DataPipeline to read the log files from your EC2 and periodically move them to S3.
Simple Workflow service is very powerful service. You can write even your workflow logic using it. Example : Most of the ecommerce systems have scalability problems in their order systems. You can use write code in SWF to make this ordering workflow process itself.
AWS Big Data Blog does a wonderful job of explaining key features of SWF, Data Pipeline & Lambda.
Below diagram is copied from the blog.