Is it possible to use enhanced fan-out with DynamoDB Streams? - amazon-web-services

When using Kinesis Data Streams, it's possible to use the enhanced fan-out feature?
DynamoDB Streams API says:
The DynamoDB Streams API is intentionally similar to that of Kinesis Streams, a service for real-time processing of streaming data at massive scale. In both services, data streams are composed of shards, which are containers for stream records. Both services' APIs contain ListStreams, DescribeStream, GetShards, and GetShardIterator actions. (Even though these DynamoDB Streams actions are similar to their counterparts in Kinesis Streams, they are not 100% identical.)
Is it possible to use enhanced fan-out with DynamoDB Streams? Is there any documentation or code samples that talks about this use?
It looks like the answer might be no because enhanced fan-out requires the SubscribeToShard method which wasn't listed above.

Yes, DynamoDb streams does not support the fancy enhanced fan-out. Think of Kinesis Data Streams = dynamoDb streams 2.0. So DynamoDb streams wont be changing anytime soon.
DynamoDb streams was designed primarily as oplog to notify a "single" consumer (two at most). With the question in mind, I assume you want to have multiple consumers with dedicated capacities connecting to the DynamoDb stream?
I would highly recommend you use a lambda to forward your data to Kinesis which would give you all these capabilities.
However if really need to stick with "only" DynamoDb streams.. then depending on the number of consumers, you will need to have additional code that can handle the throttling errors gracefully or/and reduce the dynamodb stream polling period.

Related

What is the different between Amazon Kinesis data stream and DynamoDB stream details

I am using dynamodb and I'd like to enable dynamodb stream to process any data change in the dynamodb table. By looking at the stream options, there are two streams Amazon Kinesis data stream and DynamoDB stream. From the doc of these two streams, both are handling the data change from dynamodb table but I am not sure what the main different between using these two.
There are quite a few of the differences, which are listed in:
Streaming Options for Change Data Capture
Few notable ones are that DynamoDB Streams, unlike Kinesis Data Streams for DynamoDB, guarantees no duplicates, the record retention time is only 24 hours, and the are throughout capacity limits.
Another important difference is that DynamoDB Streams guarantees order while Kinesis (associated with a DynamoDB table) does not.

Multi destinations for Kinesis stream at same time

Can we have multiple destinations from single Kinesis Streams?
I am getting output in Splunk but now I also want to add an S3 bucket as the destination.
If I add another Amazon Kinesis Data Firehose, will it affect the performance of Splunk reading? Splunk pulls directly from Kinesis. If I add another destination will it affect Will it affects our current read and writes?
One of the benefits of using Kinesis is that you can do exactly this behaviour.
Each consumer application becomes responsible for which events it has read from the shard. There is no concept of an entry being processed already between 2 seperate applications.
One recommendation from AWS to bare in mind for high throughput for multiple consumers is to use enhanced fanout.
Each consumer registered to use enhanced fan-out receives its own read throughput per shard, up to 2 MB/sec, independently of other consumers.

How to wire a DynamoDB stream to a kinesis stream?

I was assuming I
create a table and enable stream and I now have an ARN
create a kinesis stream
configure somewhere to tell the dynamoDb stream to write to kinesis stream
I was looking at working with https://github.com/harlow/kinesis-consumer but this reads from kinesis or can I use the ARN and use it to read right from the dynamoDB stream?
The more I look, the more I seem to think, I have to write a lambda to read dynamoDB and write to kinesis. Is that correct?
thanks
Hey can you provide a bit more of information about your target setup? do you plan to have some sort of ETL process for your dynamoDB table? AFAIK when you bound a kinesis stream to a dynamodb table, everytime you add, remove or update rows on the dynamodb a new event will be publish in the associated kinesis stream which you can consume from and use the event in whatever way you want.
maybe worth checking this one:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html
DynamoDB now support Kinesis Data Streams natively:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/kds.html
You can choose either DynamoDB Streams or Kinesis Data Streams for your Change Data Capture (CDC).
Properties
Kinesis Data Streams for DynamoDB
DynamoDB Streams
Data retention
Up to 1 year.
24 hours.
Kinesis Client Library (KCL) support
Supports KCL versions 1.X and 2.X.
Supports KCL version 1.X.
Number of consumers
Up to 5 simultaneous consumers per shard, or up to 20 simultaneous consumers per shard with enhanced fan-out.
Up to 2 simultaneous consumers per shard.
Throughput quotas
Unlimited.
Subject to throughput quotas by DynamoDB table and AWS Region.
Record delivery model
Pull model over HTTP using GetRecords and with enhanced fan-out, Kinesis Data Streams pushes the records over HTTP/2 by using SubscribeToShard.
Pull model over HTTP using GetRecords.
Ordering of records
The timestamp attribute on each stream record can be used to identify the actual order in which changes occurred in the DynamoDB table.
For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.
Duplicate records
Duplicate records might occasionally appear in the stream.
No duplicate records appear in the stream.
Stream processing options
Process stream records using AWS Lambda, Kinesis Data Analytics, Kinesis data firehose , or AWS Glue streaming ETL.
Process stream records using AWS Lambda or DynamoDB Streams Kinesis adapter.
Durability level
Availability zones to provide automatic failover without interruption.
Availability zones to provide automatic failover without interruption.
You can use Amazon Kinesis Data Streams to capture changes to Amazon DynamoDB. According to the AWS documentation:
Kinesis Data Streams captures item-level modifications in any DynamoDB table and replicates them to a Kinesis data stream. Your applications can access this stream and view item-level changes in near-real time. You can continuously capture and store terabytes of data per hour. You can take advantage of longer data retention timeā€”and with enhanced fan-out capability, you can simultaneously reach two or more downstream applications. Other benefits include additional audit and security transparency.
Also You can enable streaming to Kinesis from your DynamoDB table.

Event Sourcing with Kinesis - Replaying and Persistence

I am trying to implement an event-driven architecture using Amazon Kinesis as the central event log of the platform. The idea is pretty much the same to the one presented by Nordstrom's with the Hello-Retail project.
I have done similar things with Apache Kafka before, but Kinesis seems to be a cost-effective alternative to Kafka and I decided to give it a shot. I am, however, facing some challenges related to event persistence and replaying. I have two questions:
Are you guys using Kinesis for such use-case OR do you recommend using it?
Since Kinesis is not able to retain the events forever (like Kafka does), how to handle replays from consumers?
I'm currently using a lambda function (Firehose is also an option) to persist all events to Amazon S3. Then, one could read past events from the storage and then start listening to new events coming from the stream. But I'm not happy with this solution. Consumers are not be able to use Kinesis' checkpoints (Kafka's consumer offsets). Plus, Java's KCL does not support the AFTER_SEQUENCE_NUMBER yet, which would be useful in such implementation.
First question. Yes I am using Kinesis Streams when I need to process the received log / event data before storing in S3. When I don't I use Kinesis Firehose.
Second question. Kinesis Streams can store data up to seven days. This is not forever, but should be enough time to process your events. Depending on the value of the events being processed ....
If I do not need to process the event stream before storing in S3, then I use Kinesis Firehose writing to S3. Now I do not have to worry about event failures, persistence, etc. I then process the data stored in S3 with the best tool. I use Amazon Athena often and Amazon Redshift too.
You don't mention how much data you are processing or how it is being processed. If it is large, multiple MB / sec or higher, then I would definitely use Kinesis Firehose. You have to manage performance with Kinesis Streams.
One issue that I have with Kinesis Streams is that I don't like the client libraries, so I prefer to write everything myself. Kinesis Firehose reduces coding for custom applications as you just store the data in S3 and then process afterwards.
I like to think of S3 as my big data lake. I prefer to throw everything into S3 without preprocessing and then use various tools to pull out the data that I need. By doing this I remove lots of points of failure that need to be managed.

Multiple AWS Lambda functions on a Single DynamoDB Stream

I have a Lambda function to which multiple DynamoDB streams are configured as event sources and this is a part of a bigger pipeline. While doing my checks, I found some missing data in one of the downstream components. I want to write a simpler Lambda function which is configured as an event source to one of the earlier mentioned DynamoDB streams. This would cause one of my DynamoDB streams to have two Lambda functions reading from it. I was wondering if this is OK? Are both Lamdba functions guaranteed to receive all records placed in the stream and are there any resource (Read/Write throughput) limits I need to be aware of. Couldn't find any relevant documentation for this on the AWS website, but I did find this regarding processing of shards
To access a stream and process the stream records within, you must do
the following:
Determine the unique Amazon Resource Name (ARN) of the stream that you want to access.
Determine which shard(s) in the stream contain the stream records that you are interested in.
Access the shard(s) and retrieve the stream records that you want.
Note No more than 2 processes at most should be reading from the same
Streams shard at the same time. Having more than 2 readers per shard
may result in throttling.
Not sure how the above relates to cases where Streams are configured as Event sources to Lambdas as opposed to manually reading from a Stream using the API.
You can have multiple Lambdas using the same stream as an event source. They will not interfere with each other. But as the documentation says: "Note No more than 2 processes at most should be reading from the same Streams shard at the same time. Having more than 2 readers per shard may result in throttling."
So if you heavily utilize your streams you should not have more than two Lambdas connected to them.
This AWS Blog post https://aws.amazon.com/de/blogs/database/how-to-perform-ordered-data-replication-between-applications-by-using-amazon-dynamodb-streams/ suggest that you attach only one Lambda to the DDB stream and use a fan out pattern for parallel processing. This will help you processing the DDB items in order.