We have a team writing data into into Kinesis Stream on their account (Acc A). I want to consume this data in my account (Acc B) ideally using Kinesis Firehose and just moving that data into S3.
I went through pages and pages of AWS documentation but I haven't found an answer. Is it possible to consume data from Kinesis Stream using Firehose which is in a different AWS Account. And if yes, then how and if not, what would be the most straight forward solution to get this data and move it to S3?
Bear in mind that the owners of the Stream refuse to do any changes in their architecture.
Thank you for any help.
I would recommend to convice the source account owner to create an kinesis delivery stream on their side to push data into your S3 bucket. This is much easier.
Never the less, if this is not an option you can create a Kinesis Data Analytics Stream to consume data from another account which you then can pump into S3 using a Kinesis Delivery Stream.
A similiar approach is explained here: https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-cross.html
Having a Kinesis Data Stream connecting directly to Kinesis Firehose in a different account is right now not possible. But, you could for example use a lambda function to transfer data between cross account - This is a setup I am using and it works fine. This could be quite expensive depending on the amount of data.
There is a tutorial available here - only difference to your requirement is that this lambda is based in account A.
Related
I have events that keep coming which I need to put to S3. I am trying to evaluate if I muse use Kinesis Stream or Firehose. I also want to wait for few minutes before writing to S3 so that the object is fairly full.
Based on my reading of Kinesis Data stream, I have to create an analytics app which will then be used to invoke a lambda. I will then have to use the lambda to write to S3. Or Kinesis Data Streams can directly write to lambda somehow? I could not find anything indicating the same.
Firehose is not charged by hour(while stream is). So is firehose a better option for me?
Or Kinesis Data Streams can directly write to lambda somehow?
Data Streams can't write directly to S3. Instead Firehose can do this:
delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service providers, including Datadog, MongoDB, and New Relic.
What's more Firehose allows you to buffer the records before writing them to S3. The writing can happen based on buffer size or time. In addition to that you can process the records using lambda function before writing to S3.
Thus, colectively it seems that Firehose is more suited to your use-case then Data Streams.
I am working on an application that will read and analyze the logs of payment transactions. I know I will use Kinesis Analytics as per my requirements, which takes the input from the Data Streams and Firehose. But I am having trouble deciding which input method should I use for my system. My requirements are:
It can tolerate latency, but Data shouldn't lose data.
Must record all the errors in DynamoDB or S3 buckets.
Which input stream is suitable for my use case?
Data Streams vs Firehose
Streams:
Kinesis data streams is highly customizable and best suited for developers building custom applications or streaming data for specialized needs.
Going to write custom code
Real time (200ms latency for classic, 70ms latency for enhanced fan-out)
You must manage scaling (shard splitting/merging)
Data storage for 1 to 7 days, replay capability, multi consumers
Use with Lambda to insert data in real-time to ElasticSearch
Firehose:
Firehose handles loading data streams directly into AWS products for processing.
Fully managed, send to S3, Splunk, Redshift, ElasticSearch
Serverless data transformations with Lambda
Near real time (lowest buffer time is 1 minute)
Automated Scaling
No data storage
Kinesis Data Streams allows consumers to READ streaming data. And it gives you a plenty of options to do so. It is best suitable for use cases that require custom processing, choice of stream processing frameworks, and sub-second processing latency.
Data is reliably stored in streams up to 7 days and distributed across 3 Availability Zones.
Kinesis Firehose is used to LOAD streaming data to a target destination (S3, Elasticsearch, Splunk, etc). You can also transform streaming data (by using Lambda) before loading it to destination.
Data from failed attempts will be saved to S3.
So, if your goal is to only load data to Kinesis Data Analytics service with minimal or no pre-processing then try Kinesis Firehose first.
Please note, that you also would need to consider such aspects as cost, development efforts, scaling options, volume of the data when choosing a proper service.
Please take a look at the following AWS Solutions Implementation for reference:
https://aws.amazon.com/solutions/implementations/real-time-web-analytics-with-kinesis/
https://aws.amazon.com/solutions/implementations/real-time-iot-device-monitoring-with-kinesis/
There are some key differences between Kinesis Stream (KS) and Firehose (FH):
KS is real time, while FH is near-real time.
KS requires manual scaling and setup of its provisioning (shards) , while FH is basically serverless.
KS records are immutable (they persist in stream for its retention period - default 24h), while records in FH are gone from FH the moment they are delivered to destination.
From what you wrote, I think FH should be considered first, as you are not concerned about non-real-time nature of FH, it is much easier to manage and setup, and you can specify S3 as a backup for failed or all messages:
Kinesis Data Firehose uses Amazon S3 to backup all or failed only data that it attempts to deliver to your chosen destination.
The S3 backup ensures you are not loosing records, if delivery or lambda processing fail. Subsequently, in my view, Firehose addresses your two points well.
You can use firehose to feed into analytics, but question is how firehose gets data? You can write your own code to feed data or use kinesis data steams. Firehose mainly is delivery system for stream data that can be written in to various destinations such as S3, Redshift or others with optional capability to perform data transformation.
Check this link https://www.slideshare.net/AmazonWebServices/abd217from-batch-to-streaming?from_action=save and see how your use case can benefit from the information.
More info: https://docs.aws.amazon.com/kinesisanalytics/latest/dev/how-it-works.html
https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
If you are creating s3 files from the kinesis stream but you dont require cleaning of those s3 files then go with the firehose option. Also if you dont have any partitioning key requirement that makes many small s3 files then firehose is a good solution. If you are doing more cleaning up the FH files than you would have created those s3 files yourself then FH isnt a good option.
Also depends on what do you with those s3 files. You need to find out if you are saving any work/money because of using Firehose against the manual creation of S3 files. Remember you cant reorder the content of the s3 files.
I want to build a use case where I want to do real time analytics. I am not sure when it is necessary to use Kinesis Data Streams before Kinesis Firehose. In the documentation it says that Kinesis Firehose can get the data from Kinesis Data Streams but the use cases are not clear.
https://aws.amazon.com/kinesis/data-firehose/faqs/?nc=sn&loc=5
So the benefit of using Kinesis Firehose to have data passed from Kinesis Data Streams is that it integrates directly with the following services: S3, Redshift, ElasticSearch Service, Splunk.
If you want your streamed data to be delivered to any of those endpoints by passing to Firehose you can have it do the work for you.
Traditionally you'd write your own consumer which would be another piece of code to develop and maintain if it breaks. But using Firehose you can rely on AWS to do this part for you.
Firehose is fully managed whereas Streams is manually managed.
If other people are aware of other major differences, please add them. I'm just learning.
Thanks..
Amazon Kinesis Data Firehose can send data to:
Amazon S3
Amazon Redshift
Amazon Elasticsearch Service
Splunk
To do the same thing with Amazon Kinesis Data Streams, you would need to write an application that consumes data from the stream and then connects to the destination to store data.
So, think of Firehose as a pre-configured streaming application with a few specific options. Anything outside of those options would require you to write your own code.
Kinesis Firehose, as well as Kinesis Streams, are used to load streaming data as per the details mentioned in the AWS blogs. There is no concept of shards or maintenance in case of Firehose. In such a case, Is Kinesis Firehose a replacement to Kinesis Streams?
Amazon Kinesis Firehose is an easy way to create a stream where data is sent to one of:
Amazon S3
Amazon Redshift
Amazon Elasticache
You can also create a Lambda function that can manipulate the data on the way through.
If the above suits your needs, then Firehose could be considered a replacement for Kinesis Streams. However, Kinesis Streams offers more flexibility so it is not an exact replacement.
Kinesis Firehose is not a replacement to Kinesis Streams although there are several use cases, Kinesis Firehose has taken over after its introduction.
Kinesis Streams is used to buffer the streaming data from producers and streaming it into custom applications for data processing and analysis which will consume the temporary buffered stream data.
Data producers push data to Kinesis Streams -> Applications read the data from stream and process.
Kinesis Firehose is used to capture and load streaming data into other Amazon services such as S3 and Redshift so that analysis can take place later on.
Data producers push data to Kinesis Firehose -> Data Transformation using Lambda -> Store in S3 or Redshift.
These two can also be used in combination where, Kinesis Streams can stream the data in to Kinesis Firehose so that, it could be persisted after processing.
A thing to take into account when choosing which service to use are the limits and scalability of each solution.
AWS Firehose has a fixed limit of 5mb/sec or 5000 rec/sec (details here), although it can be increased by contacting AWS through a request form.
On the other hand, AWS Kinesis can be scaled easily by increasing the number of shards for each Stream (up to 500 shards by default). The main issue here is that each shard has its own cost and you can only scale up or down by doubling the current amount of shards.
As Ashan said, these services serve different purposes, but you can use each one on its own, or combine them according to your needs. The main advantage here, is that Kinesis Stream can be consumed by many consumers, and be fed by many producers. On the other hand, Firehose Streams act as a consumer for other source of data (such as a Kinesis Stream) and can output data to only one destination (S3, Redshit, Elasticsearch, Splunk).
Not sure how it would be a replacement if there is no persistence of data with Kinesis Firehose, unless you mean it in the context of there is no need for data persistence or perhaps its an issue of cost, then your option would be to analyze that data as soon as it comes in which is Kinesis Firehose and eventually storing it in S3 or ElasticSearch Cluster.
No, just different purposes.
With Kinesis Streams, you build applications using the Kinesis Producer Library put the data into a stream and then process it with an application that uses the Kinesis Client Library and with Kinesis Connector Library send the processed data to S3, Redshift, DynamoDB or ElasticSearch.
With Kinesis Firehose it’s a bit simpler where you create the delivery stream and send the data to S3, Redshift or ElasticSearch (using the Kinesis Agent or API) directly and storing it in those services.
Kinesis Streams, on the other hand, can store the data for up to 7 days.
You may use Kinesis Streams if you want to do some custom processing with streaming data. With Kinesis Firehose you are simply ingesting it into S3, Redshift, DynamoDB or ElasticSearch.