I am trying to write some IOT data to the S3 bucket and so I know 2 options so far.
1) Use AWS CLI and put data directly to the S3.
The downside of this approach is that I would have to parse out the data and figure out how to write it to S3. So there would be some dev required here. The upside is that there isn't additional cost associated to this.
2) Use Kinesis firehose
The downside of this approach is that it costs more money. It might be wasteful because the data doesn't have to be transferred in the real time, and it's not a huge amount of data. The upside is that I don't have to write any code for this data to be written in the S3 bucket.
Is there another alternative that I can explore?
If you're looking at keeping costs low, can you use some sort of cron functionality on your IoT device to POST data to a Lambda function that writes to S3, possibly?
Option 2 with Kinesis Data Firehose has the least administrative overhead.
You may also want to look into the native IoT services. It may be possible to use IoT Core and put the data directly in S3.
Related
I have multiple data source from which I need to build and implement a DWH in AWS. I have one challenge with respect to one of my unstructured data source (Data coming from different APIs). How can I ingest data from this source into the Amazon Redshift??? Can we first pull it into Amazon S3 bucket and then integrate S3 with Amazon redshift? What is a better approach?
Yes, S3 first. You APIs can write to S3 or/and if you like you can use a service like Kinesis (with or without firehose) to populate S3. From there it is just work in Redshift.
Without knowing more about the sources, yes S3 is likely the right approach - whether you require latency in seconds, minutes or hours will be an important consideration.
If latency is not a driving concern, simply:
Set up an S3 bucket to use a destination from your initial source(s).
Create tables in your Redshift database (loading data from S3 to Redshift requires pre-existing destination table).
Use the COPY command load from S3 to Redshift.
As noted, there may be value in Kinesis, especially if you're working with real-time data streams (the service recently introduced support for skipping S3 and streaming directly to Redshift).
S3 is probably the easier approach, if you're not trying to analyze real-time streams.
I am working on an application that will read and analyze the logs of payment transactions. I know I will use Kinesis Analytics as per my requirements, which takes the input from the Data Streams and Firehose. But I am having trouble deciding which input method should I use for my system. My requirements are:
It can tolerate latency, but Data shouldn't lose data.
Must record all the errors in DynamoDB or S3 buckets.
Which input stream is suitable for my use case?
Data Streams vs Firehose
Streams:
Kinesis data streams is highly customizable and best suited for developers building custom applications or streaming data for specialized needs.
Going to write custom code
Real time (200ms latency for classic, 70ms latency for enhanced fan-out)
You must manage scaling (shard splitting/merging)
Data storage for 1 to 7 days, replay capability, multi consumers
Use with Lambda to insert data in real-time to ElasticSearch
Firehose:
Firehose handles loading data streams directly into AWS products for processing.
Fully managed, send to S3, Splunk, Redshift, ElasticSearch
Serverless data transformations with Lambda
Near real time (lowest buffer time is 1 minute)
Automated Scaling
No data storage
Kinesis Data Streams allows consumers to READ streaming data. And it gives you a plenty of options to do so. It is best suitable for use cases that require custom processing, choice of stream processing frameworks, and sub-second processing latency.
Data is reliably stored in streams up to 7 days and distributed across 3 Availability Zones.
Kinesis Firehose is used to LOAD streaming data to a target destination (S3, Elasticsearch, Splunk, etc). You can also transform streaming data (by using Lambda) before loading it to destination.
Data from failed attempts will be saved to S3.
So, if your goal is to only load data to Kinesis Data Analytics service with minimal or no pre-processing then try Kinesis Firehose first.
Please note, that you also would need to consider such aspects as cost, development efforts, scaling options, volume of the data when choosing a proper service.
Please take a look at the following AWS Solutions Implementation for reference:
https://aws.amazon.com/solutions/implementations/real-time-web-analytics-with-kinesis/
https://aws.amazon.com/solutions/implementations/real-time-iot-device-monitoring-with-kinesis/
There are some key differences between Kinesis Stream (KS) and Firehose (FH):
KS is real time, while FH is near-real time.
KS requires manual scaling and setup of its provisioning (shards) , while FH is basically serverless.
KS records are immutable (they persist in stream for its retention period - default 24h), while records in FH are gone from FH the moment they are delivered to destination.
From what you wrote, I think FH should be considered first, as you are not concerned about non-real-time nature of FH, it is much easier to manage and setup, and you can specify S3 as a backup for failed or all messages:
Kinesis Data Firehose uses Amazon S3 to backup all or failed only data that it attempts to deliver to your chosen destination.
The S3 backup ensures you are not loosing records, if delivery or lambda processing fail. Subsequently, in my view, Firehose addresses your two points well.
You can use firehose to feed into analytics, but question is how firehose gets data? You can write your own code to feed data or use kinesis data steams. Firehose mainly is delivery system for stream data that can be written in to various destinations such as S3, Redshift or others with optional capability to perform data transformation.
Check this link https://www.slideshare.net/AmazonWebServices/abd217from-batch-to-streaming?from_action=save and see how your use case can benefit from the information.
More info: https://docs.aws.amazon.com/kinesisanalytics/latest/dev/how-it-works.html
https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
If you are creating s3 files from the kinesis stream but you dont require cleaning of those s3 files then go with the firehose option. Also if you dont have any partitioning key requirement that makes many small s3 files then firehose is a good solution. If you are doing more cleaning up the FH files than you would have created those s3 files yourself then FH isnt a good option.
Also depends on what do you with those s3 files. You need to find out if you are saving any work/money because of using Firehose against the manual creation of S3 files. Remember you cant reorder the content of the s3 files.
I am new to Data Engeenering, and am picking up a project for personal growth.
I understand Kafka is used for data ingestion, but does it make sense to use it to ingest data from an API to AWS S3?
Is there another way to do the same, or what situation can be brought to use Kafka in such a situation?
The best way to ingest streaming data to S3 is Kinesis Data Firehose. Kinesis is a technology similar to Kafka, and Data Firehose is its specialised version for delivering data to S3 (or other destinations). It's fairly cheap and configurable, and much, much less hassle than Kafka if all you want to do is get data onto S3. Kafka is great, but it's not going to deliver data to S3 out of the box.
I have a process that publishes data into a IoT-Core and that triggers a Lambda function that inserts the payload into an Amazon S3 bucket.
I have a process that send around 1.2 million records in some seconds, and when I check in the bucket I see I have lost around 10% of the data. If I set a sleep in the Lambda function it goes beyond 15 minutes.
What is the solution for this scenario?
It appears that your requirement is to capture the events coming into IoT-Core and save them to Amazon S3.
It also sounds like your Lambda functions are being throttled due to hitting concurrency limits and data is being lost. By default, there is a limit of 10,000 concurrent AWS Lambda functions. This could potentially be fixed by requesting an increase in the maximum number of concurrent functions.
Here is a diagram from How AWS IoT works:
As shown in the digram, the Rules engine can actually be used to send data to Amazon S3 without requiring Lambda. However, this creates a separate object in Amazon S3 for every message.
If you wish to combine messages together, you can Write to Kinesis Data Firehose Using AWS IoT. Firehose will buffer the data by time or size, and then output multiple messages to an Amazon S3 object. This could be a good way to handle large volumes of data, and it also makes it easier to work with the resulting objects in S3 because there are less objects created. This makes them faster to query and process later (eg with Amazon Athena).
Going from IoT-Core rule direct to a Lambda can be fragile.
You can use Kinesis to buffer the data or Firehose to stream it directly to S3. These are standard patterns that AWS recommend for IoT in the AWS Well-Architected framework (https://d1.awsstatic.com/whitepapers/architecture/AWS-IoT-Lens.pdf).
I'm trying to design a big IoT solution of millions of devices starting from zero. That's why I need a highly scalable platform like AWS.
My devices are going to report data using AWS IoT, and that's the only thing I've really decided. I need to store a lot of data like a temperature measure every 15 minutes on each device so for that measures I've planned to insert those measures directly to DynamoDB using IoT Rules, but on the other side, I need a relational structure to store companies, temperature sensors, etc. So I thought I could store that in MySQL RDS.
After that, I need to configure a proper analysis tool, so I was thinking of Kinesis and load the data from Redshift after ETL using Data Pipeline since AWS Glue doesn't support DynamoDB.
I'm new with some of the services so I don't know exactly what I'm doing and I don't know if this approach is the best one. What do you think?.
Thanks.
I would have your applications write the edge data (Raw Data) to an S3 bucket with this flow:
Edge(With credentials) -> APIGateway -> Lambda -> S3
Save your Raw data as .json files in S3. Then you can use tools like Athena and Quicksight to visualize.
The benefits to this are:
1) Your edge devices don't have to have the AWS SDK
2) S3 is cheep and insanely scalable
3) JSON format can be read by any service so you are not locked into AWS for visualization.