I'd like to fanout/chain/replicate an Input AWS Kinesis stream To N new Kinesis streams, So that each record written to the input Kinesis will appear in each of the N streams.
Is there an AWS service or an open source solution?
I prefer not to write code to do that if there's a ready-made solution. AWS Kinesis firehose is a no solution because it can't output to kinesis. Perhaps a AWS Lambda solution if that won't be too expensive to run?
There are two ways you could accomplish fan-out of an Amazon Kinesis stream:
Use Amazon Kinesis Analytics to copy records to additional streams
Trigger an AWS Lambda function to copy records to another stream
Option 1: Using Amazon Kinesis Analytics to fan-out
You can use Amazon Kinesis Analytics to generate a new stream from an existing stream.
From the Amazon Kinesis Analytics documentation:
Amazon Kinesis Analytics applications continuously read and process streaming data in real-time. You write application code using SQL to process the incoming streaming data and produce output. Then, Amazon Kinesis Analytics writes the output to a configured destination.
Fan-out is mentioned in the Application Code section:
You can also write SQL queries that run independent of each other. For example, you can write two SQL statements that query the same in-application stream, but send output into different in-applications streams.
I managed to implement this as follows:
Created three streams: input, output1, output2
Created two Amazon Kinesis Analytics applications: copy1, copy2
The Amazon Kinesis Analytics SQL application looks like this:
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM"
(log VARCHAR(16));
CREATE OR REPLACE PUMP "COPY_PUMP1" AS
INSERT INTO "DESTINATION_SQL_STREAM"
SELECT STREAM "log" FROM "SOURCE_SQL_STREAM_001";
This code creates a pump (think of it as a continual select statement) that selects from the input stream and outputs to the output1 stream. I created another identical application that outputs to the output2 stream.
To test, I sent data to the input stream:
#!/usr/bin/env python
import json, time
from boto import kinesis
kinesis = kinesis.connect_to_region("us-west-2")
i = 0
while True:
data={}
data['log'] = 'Record ' + str(i)
i += 1
print data
kinesis.put_record("input", json.dumps(data), "key")
time.sleep(2)
I let it run for a while, then displayed the output using this code:
from boto import kinesis
kinesis = kinesis.connect_to_region("us-west-2")
iterator = kinesis.get_shard_iterator('output1', 'shardId-000000000000', 'TRIM_HORIZON')['ShardIterator']
records = kinesis.get_records(iterator, 5)
print [r['Data'] for r in records['Records']]
The output was:
[u'{"LOG":"Record 0"}', u'{"LOG":"Record 1"}', u'{"LOG":"Record 2"}', u'{"LOG":"Record 3"}', u'{"LOG":"Record 4"}']
I ran it again for output2 and the identical output was shown.
Option 2: Using AWS Lambda
If you are fanning-out to many streams, a more efficient method might be to create an AWS Lambda function:
Triggered by Amazon Kinesis stream records
That writes records to multiple Amazon Kinesis 'output' streams
You could even have the Lambda function self-discover the output streams based on a naming convention (eg any stream named app-output-*).
There is a github repo from Amazon lab providing the fanout using lambda. https://github.com/awslabs/aws-lambda-fanout . Also read "Transforming a synchronous Lambda invocation into an asynchronous one" on https://medium.com/retailmenot-engineering/building-a-high-throughput-data-pipeline-with-kinesis-lambda-and-dynamodb-7d78e992a02d , which is critical to build a truly asynchronous processing.
There are two AWS native solutions to fanning out Kinesis streams that don't require AWS Firehose or AWS Lambda.
Similar to Kafka consumer groups, Kinesis has the application name. Every consumer to the stream can provide a unique application name. If two consumer has the same application name, then messages are distributed between them. To fan out the stream, provide a different application name for those consumers that you want to receive the same messages from the stream. Kinesis will, under the hood, create new DynamoDB tables to keep track of each consumer for each new application so that they can consume messages at a different rate, etc.
Use Kinesis Enhanced Fan-Out for higher throughput (up to 2MiB per second) and this does not count towards your global read limit. At the time of writing, there is a limit of 20 "enhanced fan-out" consumers per stream.
One caveat as far I am aware with these two options is that you need to use the Kinesis Client Library (KCL) (and not the raw AWS SDK).
Related
I have been Stocked on how to send Lambda logs(Prints) directly to Amazon Kinesis Data Stream. I have Found the way to send Logs from Cloud watch but I would like to send every single prints to kinesis data streams. I have a doubt if I send data from cloud watch does it stream real time prints records to kinesis or not? On this case I would like to use lambda as producer and through the kinesis data S3 as a consumer .
below I have attached a flow work of my conditions.
You can also check the lambda extensions, which helps into direct ingestion of the logs to custom destinations. Its helpful incase you want to avoid cloudwatch costs
https://aws.amazon.com/blogs/compute/using-aws-lambda-extensions-to-send-logs-to-custom-destinations/
You have to create CouldWatch Subscription filter for the Lambda's log stream you want to save to S3. So you would do:
CW Logs subscription ---> Firehose ---> S3
I have events that keep coming which I need to put to S3. I am trying to evaluate if I muse use Kinesis Stream or Firehose. I also want to wait for few minutes before writing to S3 so that the object is fairly full.
Based on my reading of Kinesis Data stream, I have to create an analytics app which will then be used to invoke a lambda. I will then have to use the lambda to write to S3. Or Kinesis Data Streams can directly write to lambda somehow? I could not find anything indicating the same.
Firehose is not charged by hour(while stream is). So is firehose a better option for me?
Or Kinesis Data Streams can directly write to lambda somehow?
Data Streams can't write directly to S3. Instead Firehose can do this:
delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service providers, including Datadog, MongoDB, and New Relic.
What's more Firehose allows you to buffer the records before writing them to S3. The writing can happen based on buffer size or time. In addition to that you can process the records using lambda function before writing to S3.
Thus, colectively it seems that Firehose is more suited to your use-case then Data Streams.
I was assuming I
create a table and enable stream and I now have an ARN
create a kinesis stream
configure somewhere to tell the dynamoDb stream to write to kinesis stream
I was looking at working with https://github.com/harlow/kinesis-consumer but this reads from kinesis or can I use the ARN and use it to read right from the dynamoDB stream?
The more I look, the more I seem to think, I have to write a lambda to read dynamoDB and write to kinesis. Is that correct?
thanks
Hey can you provide a bit more of information about your target setup? do you plan to have some sort of ETL process for your dynamoDB table? AFAIK when you bound a kinesis stream to a dynamodb table, everytime you add, remove or update rows on the dynamodb a new event will be publish in the associated kinesis stream which you can consume from and use the event in whatever way you want.
maybe worth checking this one:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html
DynamoDB now support Kinesis Data Streams natively:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/kds.html
You can choose either DynamoDB Streams or Kinesis Data Streams for your Change Data Capture (CDC).
Properties
Kinesis Data Streams for DynamoDB
DynamoDB Streams
Data retention
Up to 1 year.
24 hours.
Kinesis Client Library (KCL) support
Supports KCL versions 1.X and 2.X.
Supports KCL version 1.X.
Number of consumers
Up to 5 simultaneous consumers per shard, or up to 20 simultaneous consumers per shard with enhanced fan-out.
Up to 2 simultaneous consumers per shard.
Throughput quotas
Unlimited.
Subject to throughput quotas by DynamoDB table and AWS Region.
Record delivery model
Pull model over HTTP using GetRecords and with enhanced fan-out, Kinesis Data Streams pushes the records over HTTP/2 by using SubscribeToShard.
Pull model over HTTP using GetRecords.
Ordering of records
The timestamp attribute on each stream record can be used to identify the actual order in which changes occurred in the DynamoDB table.
For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.
Duplicate records
Duplicate records might occasionally appear in the stream.
No duplicate records appear in the stream.
Stream processing options
Process stream records using AWS Lambda, Kinesis Data Analytics, Kinesis data firehose , or AWS Glue streaming ETL.
Process stream records using AWS Lambda or DynamoDB Streams Kinesis adapter.
Durability level
Availability zones to provide automatic failover without interruption.
Availability zones to provide automatic failover without interruption.
You can use Amazon Kinesis Data Streams to capture changes to Amazon DynamoDB. According to the AWS documentation:
Kinesis Data Streams captures item-level modifications in any DynamoDB table and replicates them to a Kinesis data stream. Your applications can access this stream and view item-level changes in near-real time. You can continuously capture and store terabytes of data per hour. You can take advantage of longer data retention time—and with enhanced fan-out capability, you can simultaneously reach two or more downstream applications. Other benefits include additional audit and security transparency.
Also You can enable streaming to Kinesis from your DynamoDB table.
Kinesis Firehose, as well as Kinesis Streams, are used to load streaming data as per the details mentioned in the AWS blogs. There is no concept of shards or maintenance in case of Firehose. In such a case, Is Kinesis Firehose a replacement to Kinesis Streams?
Amazon Kinesis Firehose is an easy way to create a stream where data is sent to one of:
Amazon S3
Amazon Redshift
Amazon Elasticache
You can also create a Lambda function that can manipulate the data on the way through.
If the above suits your needs, then Firehose could be considered a replacement for Kinesis Streams. However, Kinesis Streams offers more flexibility so it is not an exact replacement.
Kinesis Firehose is not a replacement to Kinesis Streams although there are several use cases, Kinesis Firehose has taken over after its introduction.
Kinesis Streams is used to buffer the streaming data from producers and streaming it into custom applications for data processing and analysis which will consume the temporary buffered stream data.
Data producers push data to Kinesis Streams -> Applications read the data from stream and process.
Kinesis Firehose is used to capture and load streaming data into other Amazon services such as S3 and Redshift so that analysis can take place later on.
Data producers push data to Kinesis Firehose -> Data Transformation using Lambda -> Store in S3 or Redshift.
These two can also be used in combination where, Kinesis Streams can stream the data in to Kinesis Firehose so that, it could be persisted after processing.
A thing to take into account when choosing which service to use are the limits and scalability of each solution.
AWS Firehose has a fixed limit of 5mb/sec or 5000 rec/sec (details here), although it can be increased by contacting AWS through a request form.
On the other hand, AWS Kinesis can be scaled easily by increasing the number of shards for each Stream (up to 500 shards by default). The main issue here is that each shard has its own cost and you can only scale up or down by doubling the current amount of shards.
As Ashan said, these services serve different purposes, but you can use each one on its own, or combine them according to your needs. The main advantage here, is that Kinesis Stream can be consumed by many consumers, and be fed by many producers. On the other hand, Firehose Streams act as a consumer for other source of data (such as a Kinesis Stream) and can output data to only one destination (S3, Redshit, Elasticsearch, Splunk).
Not sure how it would be a replacement if there is no persistence of data with Kinesis Firehose, unless you mean it in the context of there is no need for data persistence or perhaps its an issue of cost, then your option would be to analyze that data as soon as it comes in which is Kinesis Firehose and eventually storing it in S3 or ElasticSearch Cluster.
No, just different purposes.
With Kinesis Streams, you build applications using the Kinesis Producer Library put the data into a stream and then process it with an application that uses the Kinesis Client Library and with Kinesis Connector Library send the processed data to S3, Redshift, DynamoDB or ElasticSearch.
With Kinesis Firehose it’s a bit simpler where you create the delivery stream and send the data to S3, Redshift or ElasticSearch (using the Kinesis Agent or API) directly and storing it in those services.
Kinesis Streams, on the other hand, can store the data for up to 7 days.
You may use Kinesis Streams if you want to do some custom processing with streaming data. With Kinesis Firehose you are simply ingesting it into S3, Redshift, DynamoDB or ElasticSearch.
As an alternative to resharding which causes much latency how can we dynamically create aws kinesis streams and round-robin-ing to streams?
I have solved this partially by taking the sample stream messages of 2000 from the twitter hosebird- client.
I have created 3 AWS Kinesis Streams each getting 500 messages and calling the next AWS Kinesis Stream by automatically(java - code).
For the last 500 messages the last stream again called the first stream.
Is this the right approach ?
I am trying do this by the concept given in the book : Concepts of OS by Galvin and apply the same for multiple AWS Kinesis Streams.