I am new to Data Engeenering, and am picking up a project for personal growth.
I understand Kafka is used for data ingestion, but does it make sense to use it to ingest data from an API to AWS S3?
Is there another way to do the same, or what situation can be brought to use Kafka in such a situation?
The best way to ingest streaming data to S3 is Kinesis Data Firehose. Kinesis is a technology similar to Kafka, and Data Firehose is its specialised version for delivering data to S3 (or other destinations). It's fairly cheap and configurable, and much, much less hassle than Kafka if all you want to do is get data onto S3. Kafka is great, but it's not going to deliver data to S3 out of the box.
Related
I have different data sources and I need to publish them to S3 in real-time. I also need to process and validate data before delivering them to S3 buckets. I know that AWS Kinesis Data Stream offers Real-time data streaming and I can process data using AWS lambda before sending them to S3. However, it is not clear for me that can we use AWS Glue Streaming instead of AWS Kinesis Data Stream and AWS Lambda? I have seen some documentations about using AWS Glue Streaming for processing real-time data on the fly and send them to S3. So, what is the real differences here? Is AWS Glue Streaming ETL a good choice for streaming and processing data in real-time and store them into S3?
Kinesis data stream with lambda consumer will fit as long as the lambda execution environment limits is sufficient
15 mins execution time
Memory config
Concurrency limits
When going with glue consumer, your glue jobs can run longer and also supports Apache spark for massive parallel processing
You can also use Kinesis firehose which has native integration to deliver data to S3, ElasticSearch etc..., which doesn't require any changes to data. You can also have a lambda to do minimal processing intercepting the data before delivering using firehose.
I am trying to write some IOT data to the S3 bucket and so I know 2 options so far.
1) Use AWS CLI and put data directly to the S3.
The downside of this approach is that I would have to parse out the data and figure out how to write it to S3. So there would be some dev required here. The upside is that there isn't additional cost associated to this.
2) Use Kinesis firehose
The downside of this approach is that it costs more money. It might be wasteful because the data doesn't have to be transferred in the real time, and it's not a huge amount of data. The upside is that I don't have to write any code for this data to be written in the S3 bucket.
Is there another alternative that I can explore?
If you're looking at keeping costs low, can you use some sort of cron functionality on your IoT device to POST data to a Lambda function that writes to S3, possibly?
Option 2 with Kinesis Data Firehose has the least administrative overhead.
You may also want to look into the native IoT services. It may be possible to use IoT Core and put the data directly in S3.
Firehose is fully managed whereas Streams is manually managed.
If other people are aware of other major differences, please add them. I'm just learning.
Thanks..
Amazon Kinesis Data Firehose can send data to:
Amazon S3
Amazon Redshift
Amazon Elasticsearch Service
Splunk
To do the same thing with Amazon Kinesis Data Streams, you would need to write an application that consumes data from the stream and then connects to the destination to store data.
So, think of Firehose as a pre-configured streaming application with a few specific options. Anything outside of those options would require you to write your own code.
I am trying to implement an event-driven architecture using Amazon Kinesis as the central event log of the platform. The idea is pretty much the same to the one presented by Nordstrom's with the Hello-Retail project.
I have done similar things with Apache Kafka before, but Kinesis seems to be a cost-effective alternative to Kafka and I decided to give it a shot. I am, however, facing some challenges related to event persistence and replaying. I have two questions:
Are you guys using Kinesis for such use-case OR do you recommend using it?
Since Kinesis is not able to retain the events forever (like Kafka does), how to handle replays from consumers?
I'm currently using a lambda function (Firehose is also an option) to persist all events to Amazon S3. Then, one could read past events from the storage and then start listening to new events coming from the stream. But I'm not happy with this solution. Consumers are not be able to use Kinesis' checkpoints (Kafka's consumer offsets). Plus, Java's KCL does not support the AFTER_SEQUENCE_NUMBER yet, which would be useful in such implementation.
First question. Yes I am using Kinesis Streams when I need to process the received log / event data before storing in S3. When I don't I use Kinesis Firehose.
Second question. Kinesis Streams can store data up to seven days. This is not forever, but should be enough time to process your events. Depending on the value of the events being processed ....
If I do not need to process the event stream before storing in S3, then I use Kinesis Firehose writing to S3. Now I do not have to worry about event failures, persistence, etc. I then process the data stored in S3 with the best tool. I use Amazon Athena often and Amazon Redshift too.
You don't mention how much data you are processing or how it is being processed. If it is large, multiple MB / sec or higher, then I would definitely use Kinesis Firehose. You have to manage performance with Kinesis Streams.
One issue that I have with Kinesis Streams is that I don't like the client libraries, so I prefer to write everything myself. Kinesis Firehose reduces coding for custom applications as you just store the data in S3 and then process afterwards.
I like to think of S3 as my big data lake. I prefer to throw everything into S3 without preprocessing and then use various tools to pull out the data that I need. By doing this I remove lots of points of failure that need to be managed.
Background
I have found that Amazon Kinesis Data Analytics can be used for streaming data as well as data present in an S3 bucket.
However, there are some parts of the Kinesis documentation that make me question whether Amazon Kinesis Analytics can be used for a huge amount of existing data in an S3 bucket:
Authoring Application Code
We recommend the following:
In your SQL statement, don't specify a time-based window that is longer than one hour for the following reasons:
Sometimes an application needs to be restarted, either because you updated the application or for Kinesis Data Analytics internal reasons. When it restarts, all data included in the window must be read again from the streaming data source. This takes time before Kinesis Data Analytics can emit output for that window.
Kinesis Data Analytics must maintain everything related to the application's state, including relevant data, for the duration. This consumes significant Kinesis Data Analytics processing units.
Question
Will Amazon Kinesis Analytics be good for this task?
The primary use case for Amazon Kinesis Analytics is stream data processing. For this reason, you attach an Amazon Kinesis Analytics application to a streaming data source. You can optionally include reference data from S3, which is limited in size to 1 GB at this time. We will load data from an S3 object into a SQL table that you can use to enrich the incoming stream.
It sounds like want a more general purpose tool for querying data from S3, not a stream data processing solution. I would recommend looking at Presto and Amazon EMR instead of using Amazon Kinesis Analytics.
Disclaimer: I work for the Amazon Kinesis team.