IoT Big Data design on AWS - amazon-web-services

I'm trying to design a big IoT solution of millions of devices starting from zero. That's why I need a highly scalable platform like AWS.
My devices are going to report data using AWS IoT, and that's the only thing I've really decided. I need to store a lot of data like a temperature measure every 15 minutes on each device so for that measures I've planned to insert those measures directly to DynamoDB using IoT Rules, but on the other side, I need a relational structure to store companies, temperature sensors, etc. So I thought I could store that in MySQL RDS.
After that, I need to configure a proper analysis tool, so I was thinking of Kinesis and load the data from Redshift after ETL using Data Pipeline since AWS Glue doesn't support DynamoDB.
I'm new with some of the services so I don't know exactly what I'm doing and I don't know if this approach is the best one. What do you think?.
Thanks.

I would have your applications write the edge data (Raw Data) to an S3 bucket with this flow:
Edge(With credentials) -> APIGateway -> Lambda -> S3
Save your Raw data as .json files in S3. Then you can use tools like Athena and Quicksight to visualize.
The benefits to this are:
1) Your edge devices don't have to have the AWS SDK
2) S3 is cheep and insanely scalable
3) JSON format can be read by any service so you are not locked into AWS for visualization.

Related

Transfer/Replicate Data periodically from AWS Documentdb to Google Cloud Big Query

We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.
This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.

Creating a data lake from a DynamoDB table

We have a service where a DynamoDB table ~50GB is our feature repository, which we use for real-time, online applications.
We want to create a data lake from this table for historical data, model training and analytics insights. We want to guarantee a 30-minutes "freshness" of data lake data w.r.t. the original table.
However, I'm confused on what could be a good architecture for this: my understanding of data lakes is that you should use a storage service (i.e., S3) to store the raw data with no processing. Then, you perform ETL jobs, where you transform, process and filter the data (e.g., using Glue) before using for whatever app.
But here is my doubt: does this means that we have to dump the DynamoDB table into S3 every 30 minutes? This can be easily done, but it sounds weird (this would result in ~876TB/year).
Am I missing something in the data lake pipeline?
You've hit a common problem, and its one AWS are actively working on.
If you want continous sync-ing from dynamodb to S3, its possible using existing technology including dynamodb streams. I suggest checking out this project in awslabs. Frankly its quite a bit of effort.
However, I believe AWS are about to release a product that will keep dynamodb tables and S3 buckets in sync, without code, in a few clicks. Its called AWS Glue Elastic Views. The product is in preview. They announced the product in December 2020 so I'm hoping it available soon. There is also a form you can fill in to join the trial but there is no guarantee AWS will give to access.

Alternative to writing data to S3 using Kinesis Firehose

I am trying to write some IOT data to the S3 bucket and so I know 2 options so far.
1) Use AWS CLI and put data directly to the S3.
The downside of this approach is that I would have to parse out the data and figure out how to write it to S3. So there would be some dev required here. The upside is that there isn't additional cost associated to this.
2) Use Kinesis firehose
The downside of this approach is that it costs more money. It might be wasteful because the data doesn't have to be transferred in the real time, and it's not a huge amount of data. The upside is that I don't have to write any code for this data to be written in the S3 bucket.
Is there another alternative that I can explore?
If you're looking at keeping costs low, can you use some sort of cron functionality on your IoT device to POST data to a Lambda function that writes to S3, possibly?
Option 2 with Kinesis Data Firehose has the least administrative overhead.
You may also want to look into the native IoT services. It may be possible to use IoT Core and put the data directly in S3.

What is the right architecture\design to perform javascript-client to aws-database website tracking system

We wish to build data pipeline system which tracks website interactions/events.
The goal is to track user behavior in a website so we would like to choose the right architecture to implement it having the following two constraints :
1) the system is Amazon
2) this is budgetary project so we cannot use redshift for this purpose
Based on the above two constraints my plan is to implement the following architecture:
website-javascript --> AWS-S3 -->(AWS-Lambda)--> AWS-RDS
website javascript client -
aws-firehose data delivery system to S3 - tracking user interaction and load them to aws-firehose which eventually write them in aws-S3.
AWS Lambda (Python) - Periodically task which pulls daily events from AWS-S3 and load them to AWS-RDS.
The reason I have chosen AWS-RDS is due to its cost-effectiveness for this objective
Appreciate any comment to the above mentioned implementation or any other architecture proposal that you may recommend to use instead of the above
If I understand your question correctly, you are proposing below solution to perform web analytics for your application:
WebServer --> Firehose --> AWS-S3 --> AWS-Lambda --> AWS-RDS
I see below pros and cons with above design
Pros:
low cost
easy to implement
Cons:
RDS may not be salable enough to handle analytics on massive amounts of web-streaming data, which tend to grow rapidly
Need to handle load balancing, failure scenarios and other complexities for lambda
You need to handle data transformation for RDS as it expects structured data to be ingested into relational tables
Proposal to store the data in S3 through Firehose sounds a good solution. But please keep in mind that minimum interval for Firehose is one minute, so your application needs to tolerate this minor latency. You may use Kinesis Streams to have millisecond latency, but then you need to manage your own application code and instances to handle Streams.
After ingesting data in Kinesis Firehose or Streams, you may also explore below alternatives:
Use Kinesis Analytics to track web users activity in real-time if it's available in your AWS region. It's only available in selected AWS regions currently
Within Firehose, transform your data using lambda and store it in S3 in optimized format for further analysis with AWS Athena
Use Elastic Search as a destination and perform web analytics with ELK stack instead of RDS
Though you mentioned that you can not use RedShift, it still may be the best solution for time series analysis. Exploring RedShift, RedShift Spectrum and formatted data stored in S3 may still be a cost effective solution with better cababilities
Adding few references from AWS, which you may go through before deciding on the solution:
Real-Time Web Analytics with Kinesis Data Analytics Solution
Near Real-time Analytics on Streaming Data with Amazon Kinesis and Amazon Elasticsearch
Schema-On-Read Analytics Pipeline Using Amazon Athena
Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required
Hey folky this is a getting more and more common.
Generally the pattern is click events to Kinesis streams then you can monitor user interaction with the website in real time using Kinesis analytics. You can connect the stream to firehose to offload data in to an S3 bucket as well as incorporate Lambdas to transform the data.
There is some major complexity around handling Lambdas and Kinesis streams in parallel so this solution might not be as scalable as using AWS Kafka. Or perhaps run a job to move your s3 data into rds for whatever reporting you might need that is adhoc.
Here is a pattern AWS already has real-time-web-analytics-with-kinesis

What are the differences between Amazon Redshift and the new AWS Glue datawarehousing services?

I am confused about these two services. It looks that they are offering the same service. Probably the only difference is that the Glue catalog can contain a wider range of data sources. Does it mean that AWS Glue can replace Redshift?
The Comment is right , These two services are not same AWS Glue is ETL Service while AWS Redshift is Data Warehousing service.
According to AWS Documentation :
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It allows you to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
According to AWS Documentation :
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores
You can Refer the Documentation Provided by AWS for Details but essentially these are totally different services.