I have different data sources and I need to publish them to S3 in real-time. I also need to process and validate data before delivering them to S3 buckets. I know that AWS Kinesis Data Stream offers Real-time data streaming and I can process data using AWS lambda before sending them to S3. However, it is not clear for me that can we use AWS Glue Streaming instead of AWS Kinesis Data Stream and AWS Lambda? I have seen some documentations about using AWS Glue Streaming for processing real-time data on the fly and send them to S3. So, what is the real differences here? Is AWS Glue Streaming ETL a good choice for streaming and processing data in real-time and store them into S3?
Kinesis data stream with lambda consumer will fit as long as the lambda execution environment limits is sufficient
15 mins execution time
Memory config
Concurrency limits
When going with glue consumer, your glue jobs can run longer and also supports Apache spark for massive parallel processing
You can also use Kinesis firehose which has native integration to deliver data to S3, ElasticSearch etc..., which doesn't require any changes to data. You can also have a lambda to do minimal processing intercepting the data before delivering using firehose.
I want to build a use case where I want to do real time analytics. I am not sure when it is necessary to use Kinesis Data Streams before Kinesis Firehose. In the documentation it says that Kinesis Firehose can get the data from Kinesis Data Streams but the use cases are not clear.
https://aws.amazon.com/kinesis/data-firehose/faqs/?nc=sn&loc=5
So the benefit of using Kinesis Firehose to have data passed from Kinesis Data Streams is that it integrates directly with the following services: S3, Redshift, ElasticSearch Service, Splunk.
If you want your streamed data to be delivered to any of those endpoints by passing to Firehose you can have it do the work for you.
Traditionally you'd write your own consumer which would be another piece of code to develop and maintain if it breaks. But using Firehose you can rely on AWS to do this part for you.
I'm using a Firehose delivery stream to write JSONs to S3. These JSONs represent calls. The stream will often receive a new version of a JSON, that bring new info about the represented call.
I would want my Firehose to write each JSON record to a separate S3 object, so not grouping them together as it seems to do by default. Each JSON would be written at an S3 key that identifies the call, so that when a new version of a JSON shows up, Firehose replaces its previous version in S3. Is this possible?
I see that I can set up the buffer size that triggers writing to S3, but can I explicitly configure my Firehose stream so it writes exactly one S3 object per record?
There's no Redshift involved.
This is not possible with Amazon Kinesis Data Firehose. It is a simplified service that only has a few configuration options.
Instead, you could use Amazon Kinesis Data Streams:
Send data to the stream
Create an AWS Lambda function that will be triggered whenever data is received by the stream
Code the Lambda function to write the data to the appropriate Amazon S3 object
See: Using AWS Lambda with Amazon Kinesis - AWS Lambda
Firehose is fully managed whereas Streams is manually managed.
If other people are aware of other major differences, please add them. I'm just learning.
Thanks..
Amazon Kinesis Data Firehose can send data to:
Amazon S3
Amazon Redshift
Amazon Elasticsearch Service
Splunk
To do the same thing with Amazon Kinesis Data Streams, you would need to write an application that consumes data from the stream and then connects to the destination to store data.
So, think of Firehose as a pre-configured streaming application with a few specific options. Anything outside of those options would require you to write your own code.
Kinesis Firehose, as well as Kinesis Streams, are used to load streaming data as per the details mentioned in the AWS blogs. There is no concept of shards or maintenance in case of Firehose. In such a case, Is Kinesis Firehose a replacement to Kinesis Streams?
Amazon Kinesis Firehose is an easy way to create a stream where data is sent to one of:
Amazon S3
Amazon Redshift
Amazon Elasticache
You can also create a Lambda function that can manipulate the data on the way through.
If the above suits your needs, then Firehose could be considered a replacement for Kinesis Streams. However, Kinesis Streams offers more flexibility so it is not an exact replacement.
Kinesis Firehose is not a replacement to Kinesis Streams although there are several use cases, Kinesis Firehose has taken over after its introduction.
Kinesis Streams is used to buffer the streaming data from producers and streaming it into custom applications for data processing and analysis which will consume the temporary buffered stream data.
Data producers push data to Kinesis Streams -> Applications read the data from stream and process.
Kinesis Firehose is used to capture and load streaming data into other Amazon services such as S3 and Redshift so that analysis can take place later on.
Data producers push data to Kinesis Firehose -> Data Transformation using Lambda -> Store in S3 or Redshift.
These two can also be used in combination where, Kinesis Streams can stream the data in to Kinesis Firehose so that, it could be persisted after processing.
A thing to take into account when choosing which service to use are the limits and scalability of each solution.
AWS Firehose has a fixed limit of 5mb/sec or 5000 rec/sec (details here), although it can be increased by contacting AWS through a request form.
On the other hand, AWS Kinesis can be scaled easily by increasing the number of shards for each Stream (up to 500 shards by default). The main issue here is that each shard has its own cost and you can only scale up or down by doubling the current amount of shards.
As Ashan said, these services serve different purposes, but you can use each one on its own, or combine them according to your needs. The main advantage here, is that Kinesis Stream can be consumed by many consumers, and be fed by many producers. On the other hand, Firehose Streams act as a consumer for other source of data (such as a Kinesis Stream) and can output data to only one destination (S3, Redshit, Elasticsearch, Splunk).
Not sure how it would be a replacement if there is no persistence of data with Kinesis Firehose, unless you mean it in the context of there is no need for data persistence or perhaps its an issue of cost, then your option would be to analyze that data as soon as it comes in which is Kinesis Firehose and eventually storing it in S3 or ElasticSearch Cluster.
No, just different purposes.
With Kinesis Streams, you build applications using the Kinesis Producer Library put the data into a stream and then process it with an application that uses the Kinesis Client Library and with Kinesis Connector Library send the processed data to S3, Redshift, DynamoDB or ElasticSearch.
With Kinesis Firehose it’s a bit simpler where you create the delivery stream and send the data to S3, Redshift or ElasticSearch (using the Kinesis Agent or API) directly and storing it in those services.
Kinesis Streams, on the other hand, can store the data for up to 7 days.
You may use Kinesis Streams if you want to do some custom processing with streaming data. With Kinesis Firehose you are simply ingesting it into S3, Redshift, DynamoDB or ElasticSearch.