I have a use-case in which I need to make an API call for the payload sent to my Kinesis Firehose stream before storing it in S3.
The flow would be: Kinesis Data Stream -> Kinesis Firehose -> Transformation Lambda -> API call to get additional data relating to current records -> Kinesis Firehose -> S3.
Basically, for a record that is consumed by my Kinesis Firehose stream, I need to call another backend service to get additional data related to the record before storing in S3 for our EMR jobs to consume and write queries on.
My question is, is it possible to make network calls from a Kinesis Firehose transformation Lambda. I think it should be since it's just another Lambda function. I would also like to understand if it's against best practices to make API calls in a Kinesis Firehose transformation Lambda.
Any insight is appreciated!
I dont think there is any problem in making network calls in transformation lambda function. Only thing you need to make sure is that you are returning all the recordIds back to firehose after transformation in firehose accepted format as shown in aws doc
I have an event stream which send millions of events through SNS every day. Through a lambda, these topics are then stored in s3 but each in its own file. Total size of these events is not much (less than 1 GB) but moving/deleting one day files each the size of a few bytes becomes a long process. Is there a way I can store these SNS topics into larger files (or even a single file)?
I'd have the Lambda write the events to Kinesis Data Firehose and use that to batch the events up to a certain size-threshold or time-window and then have Firehose deliver those to S3.
Here are some resources for that:
S3 Destination for the delivery stream
S3 Destination buffer size & interval
SNS with kinesis firehose looks perfect fit for this use case.
Recently aws announced kinesis firehouse support with SNS, on kinesis you can add bufferring conditions to s3.
Kinesis Data Firehose buffers incoming data before delivering it (backing it up) to Amazon S3. You can choose a buffer size of 1–128 MiBs and a buffer interval of 60–900 seconds. The condition that is satisfied first triggers data delivery to Amazon S3
In case you want to transform your events or process them you can use lambda as well.
I have events that keep coming which I need to put to S3. I am trying to evaluate if I muse use Kinesis Stream or Firehose. I also want to wait for few minutes before writing to S3 so that the object is fairly full.
Based on my reading of Kinesis Data stream, I have to create an analytics app which will then be used to invoke a lambda. I will then have to use the lambda to write to S3. Or Kinesis Data Streams can directly write to lambda somehow? I could not find anything indicating the same.
Firehose is not charged by hour(while stream is). So is firehose a better option for me?
Or Kinesis Data Streams can directly write to lambda somehow?
Data Streams can't write directly to S3. Instead Firehose can do this:
delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service providers, including Datadog, MongoDB, and New Relic.
What's more Firehose allows you to buffer the records before writing them to S3. The writing can happen based on buffer size or time. In addition to that you can process the records using lambda function before writing to S3.
Thus, colectively it seems that Firehose is more suited to your use-case then Data Streams.
I want to build a use case where I want to do real time analytics. I am not sure when it is necessary to use Kinesis Data Streams before Kinesis Firehose. In the documentation it says that Kinesis Firehose can get the data from Kinesis Data Streams but the use cases are not clear.
https://aws.amazon.com/kinesis/data-firehose/faqs/?nc=sn&loc=5
So the benefit of using Kinesis Firehose to have data passed from Kinesis Data Streams is that it integrates directly with the following services: S3, Redshift, ElasticSearch Service, Splunk.
If you want your streamed data to be delivered to any of those endpoints by passing to Firehose you can have it do the work for you.
Traditionally you'd write your own consumer which would be another piece of code to develop and maintain if it breaks. But using Firehose you can rely on AWS to do this part for you.
Kinesis Firehose, as well as Kinesis Streams, are used to load streaming data as per the details mentioned in the AWS blogs. There is no concept of shards or maintenance in case of Firehose. In such a case, Is Kinesis Firehose a replacement to Kinesis Streams?
Amazon Kinesis Firehose is an easy way to create a stream where data is sent to one of:
Amazon S3
Amazon Redshift
Amazon Elasticache
You can also create a Lambda function that can manipulate the data on the way through.
If the above suits your needs, then Firehose could be considered a replacement for Kinesis Streams. However, Kinesis Streams offers more flexibility so it is not an exact replacement.
Kinesis Firehose is not a replacement to Kinesis Streams although there are several use cases, Kinesis Firehose has taken over after its introduction.
Kinesis Streams is used to buffer the streaming data from producers and streaming it into custom applications for data processing and analysis which will consume the temporary buffered stream data.
Data producers push data to Kinesis Streams -> Applications read the data from stream and process.
Kinesis Firehose is used to capture and load streaming data into other Amazon services such as S3 and Redshift so that analysis can take place later on.
Data producers push data to Kinesis Firehose -> Data Transformation using Lambda -> Store in S3 or Redshift.
These two can also be used in combination where, Kinesis Streams can stream the data in to Kinesis Firehose so that, it could be persisted after processing.
A thing to take into account when choosing which service to use are the limits and scalability of each solution.
AWS Firehose has a fixed limit of 5mb/sec or 5000 rec/sec (details here), although it can be increased by contacting AWS through a request form.
On the other hand, AWS Kinesis can be scaled easily by increasing the number of shards for each Stream (up to 500 shards by default). The main issue here is that each shard has its own cost and you can only scale up or down by doubling the current amount of shards.
As Ashan said, these services serve different purposes, but you can use each one on its own, or combine them according to your needs. The main advantage here, is that Kinesis Stream can be consumed by many consumers, and be fed by many producers. On the other hand, Firehose Streams act as a consumer for other source of data (such as a Kinesis Stream) and can output data to only one destination (S3, Redshit, Elasticsearch, Splunk).
Not sure how it would be a replacement if there is no persistence of data with Kinesis Firehose, unless you mean it in the context of there is no need for data persistence or perhaps its an issue of cost, then your option would be to analyze that data as soon as it comes in which is Kinesis Firehose and eventually storing it in S3 or ElasticSearch Cluster.
No, just different purposes.
With Kinesis Streams, you build applications using the Kinesis Producer Library put the data into a stream and then process it with an application that uses the Kinesis Client Library and with Kinesis Connector Library send the processed data to S3, Redshift, DynamoDB or ElasticSearch.
With Kinesis Firehose it’s a bit simpler where you create the delivery stream and send the data to S3, Redshift or ElasticSearch (using the Kinesis Agent or API) directly and storing it in those services.
Kinesis Streams, on the other hand, can store the data for up to 7 days.
You may use Kinesis Streams if you want to do some custom processing with streaming data. With Kinesis Firehose you are simply ingesting it into S3, Redshift, DynamoDB or ElasticSearch.