Multiple AWS Lambda functions on a Single DynamoDB Stream - amazon-web-services

I have a Lambda function to which multiple DynamoDB streams are configured as event sources and this is a part of a bigger pipeline. While doing my checks, I found some missing data in one of the downstream components. I want to write a simpler Lambda function which is configured as an event source to one of the earlier mentioned DynamoDB streams. This would cause one of my DynamoDB streams to have two Lambda functions reading from it. I was wondering if this is OK? Are both Lamdba functions guaranteed to receive all records placed in the stream and are there any resource (Read/Write throughput) limits I need to be aware of. Couldn't find any relevant documentation for this on the AWS website, but I did find this regarding processing of shards
To access a stream and process the stream records within, you must do
the following:
Determine the unique Amazon Resource Name (ARN) of the stream that you want to access.
Determine which shard(s) in the stream contain the stream records that you are interested in.
Access the shard(s) and retrieve the stream records that you want.
Note No more than 2 processes at most should be reading from the same
Streams shard at the same time. Having more than 2 readers per shard
may result in throttling.
Not sure how the above relates to cases where Streams are configured as Event sources to Lambdas as opposed to manually reading from a Stream using the API.

You can have multiple Lambdas using the same stream as an event source. They will not interfere with each other. But as the documentation says: "Note No more than 2 processes at most should be reading from the same Streams shard at the same time. Having more than 2 readers per shard may result in throttling."
So if you heavily utilize your streams you should not have more than two Lambdas connected to them.

This AWS Blog post https://aws.amazon.com/de/blogs/database/how-to-perform-ordered-data-replication-between-applications-by-using-amazon-dynamodb-streams/ suggest that you attach only one Lambda to the DDB stream and use a fan out pattern for parallel processing. This will help you processing the DDB items in order.

Related

Multi destinations for Kinesis stream at same time

Can we have multiple destinations from single Kinesis Streams?
I am getting output in Splunk but now I also want to add an S3 bucket as the destination.
If I add another Amazon Kinesis Data Firehose, will it affect the performance of Splunk reading? Splunk pulls directly from Kinesis. If I add another destination will it affect Will it affects our current read and writes?
One of the benefits of using Kinesis is that you can do exactly this behaviour.
Each consumer application becomes responsible for which events it has read from the shard. There is no concept of an entry being processed already between 2 seperate applications.
One recommendation from AWS to bare in mind for high throughput for multiple consumers is to use enhanced fanout.
Each consumer registered to use enhanced fan-out receives its own read throughput per shard, up to 2 MB/sec, independently of other consumers.

Is it possible to use enhanced fan-out with DynamoDB Streams?

When using Kinesis Data Streams, it's possible to use the enhanced fan-out feature?
DynamoDB Streams API says:
The DynamoDB Streams API is intentionally similar to that of Kinesis Streams, a service for real-time processing of streaming data at massive scale. In both services, data streams are composed of shards, which are containers for stream records. Both services' APIs contain ListStreams, DescribeStream, GetShards, and GetShardIterator actions. (Even though these DynamoDB Streams actions are similar to their counterparts in Kinesis Streams, they are not 100% identical.)
Is it possible to use enhanced fan-out with DynamoDB Streams? Is there any documentation or code samples that talks about this use?
It looks like the answer might be no because enhanced fan-out requires the SubscribeToShard method which wasn't listed above.
Yes, DynamoDb streams does not support the fancy enhanced fan-out. Think of Kinesis Data Streams = dynamoDb streams 2.0. So DynamoDb streams wont be changing anytime soon.
DynamoDb streams was designed primarily as oplog to notify a "single" consumer (two at most). With the question in mind, I assume you want to have multiple consumers with dedicated capacities connecting to the DynamoDb stream?
I would highly recommend you use a lambda to forward your data to Kinesis which would give you all these capabilities.
However if really need to stick with "only" DynamoDb streams.. then depending on the number of consumers, you will need to have additional code that can handle the throttling errors gracefully or/and reduce the dynamodb stream polling period.

Kinesis Stream and Kinesis Firehose Updating Elasticsearch Indexes

We want to use kinesis stream and firehose to update an aws managed elasticsearch cluster. We have hundreds of different indexes (corresponding to our DB shards) that need to be updated. When creating the firehose it requires that I specify the specific index name I want updated. Does that mean I need to create a separate firehose for each index in my cluster? Or is there a way to configure the firehose so it knows what index to used based on the content of the data.
Also, we would have 20 or so separate producers that would send data to a kinesis stream (each one of these producers would generate data for 10 different indexes). Would I also need a separate kinesis stream for each producer.
Summary:
20 producers (EC2 instances) -> Each producer sends data for 20 different indexes to a kinesis stream -> The kinesis stream then uses a firehose to update a single cluster which has 200 indexes in it.
Note: all of the indexes have the same mapping and name temple i.e. index_1, index_2...index_200
Edit: As we reindex the data we create new indexes along the lines of index_1-v2. Obviously we won't want to create a new firehose for each index version as they're being created. The new index name can be included in the JSON that's sent to the kinesis stream.
As you guessed, Firehose is the wrong solution for this problem, at least as stated. It is designed for situations where there's a 1:1 correspondence between stream (not producer!) and index. Things like clickstream data or log aggregation.
For any solution, you'll need to provide a mechanism to identify which index a record belongs to. You could do this by creating a separate Kinesis stream per message type (in which case you could use Firehose), but this would mean that your producers have to decide which stream to write each message to. That may cause unwanted complexity in your producers, and may also increase your costs unacceptably.
So, assuming that you want a single stream for all messages, you need a consumer application and some way to group those messages. You could include a message type (/ index name) in the record itself, or use the partition key for that purpose. The partition key makes for a somewhat easier implementation, as it guarantees that records for the same index will be stored on the same shard, but it means that your producers may be throttled.
For the consumer, you could use an always-on application that runs on EC2, or have the stream invoke a Lambda function.
Using Lambda is nice if you're using partition key to identify the message type, because each invocation only looks at a single shard (you may still have multiple partition keys in the invocation). On the downside, Lambda will poll the stream once per second, which may result in throttling if you have multiple stream consumers (with a stand-alone app you can control how often it polls the stream).

Kinesis Analytics Destination Guidance: Lambda vs Kinesis Stream to Lambda

After Kinesis Analytics does it's job, the next step is to send that information off to a destination. AWS currently offers 3 destination choices:
Kinesis stream
Kinesis Firehose delivery stream
AWS Lambda function
For my use case, Kinesis Firehose delivery stream is not what I want so I am left with:
Kinesis stream
AWS Lambda function
If I set the destination to a Kinesis Stream, I would then attach a Lambda to that stream to process the records.
AWS also offers the ability to set the destination to a Lambda, bypassing the Kinesis Stream step of this process. In doing some digging for docs I found this:
Using a Lambda Function as Output
Specifically in those docs under Lambda Output Invocation Frequency it says:
If records are emitted to the destination in-application stream within the data analytics application as a continuous query or a sliding window, the AWS Lambda destination function is invoked approximately once per second.
My Kinesis Analytics output qualifies under this scenario. So I can assume that my Lambda will be invoked, "approximately once per second".
I'm trying to understand the difference between using these 2 destinations as it pertains to using a Lambda.
Using AWS Lambda with Kinesis states that:
You can subscribe Lambda functions to automatically read batches of records off your Kinesis stream and process them if records are detected on the stream. AWS Lambda then polls the stream periodically (once per second) for new records.
So it sounds like the the invocation interval is the same in either case; approximately 1 second.
So I think the guidence is:
If the next stage in the pipeline only needs one consumer, then use the AWS Lambda function destination. If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
Is this a correct assumption on how to choose a destination? Again, for my use case I am excluding the Kinesis Firehose delivery stream.
If the next stage in the pipeline only needs one consumer, then use the AWS Lambda function destination. If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
• I would always use Kinesis Stream with one shard and batch size = 1 (for example) if I wanted the items to be consumed one by one with no concurrency.
If there are multiple consumers, increase the number of shards, one lambda is launched in parallel for each shard when there are items to process. If it makes sense, also increase the batch size.
But read again at the highlighted phrase below:
If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
If you have one or more producers and many consumers of the exactly same item, I guess you need to use SNS. The producer writes the item on one topic, then all the lambdas listening to the topic will process that item.
If this does not answer your question, please clarify it. There is a little ambiguity.

Partition Kinesis firehose S3 records by event time

Firehose->S3 uses the current date as a prefix for creating keys in S3. So this partitions the data by the time the record is written. My firehose stream contains events which have a specific event time.
Is there a way to create S3 keys containing this event time instead? Processing tools downstream depend on each event being in an "hour-folder" related to when it actually happened. Or would that have to be an additional processing step after Firehose is done?
The event time could be in the partition key or I could use a Lambda function to parse it from the record.
Kinesis Firehose doesn't (yet) allow clients to control how the date suffix of the final S3 objects is generated.
The only option with you is to add a post-processing layer after Kinesis Firehose. For e.g., you could schedule an hourly EMR job, using Data Pipeline, that reads all files written in last hour and publishes them to correct S3 destinations.
It's not an answer for the question, however I would like to explain a little bit the idea behind storing records in accordance with event arrival time.
First a few words about streams. Kinesis is just a stream of data. And it has a concept of consuming. One can reliable consume a stream only by reading it sequentially. And there is also an idea of checkpoints as a mechanism for pausing and resuming the consuming process. A checkpoint is just a sequence number which identifies a position in the stream. Via specifying this number, one can start reading the stream from the certain event.
And now go back to default s3 firehose setup... Since the capacity of kinesis stream is quite limited, most probably one needs to store somewhere the data from kinesis to analyze it later. And the firehose to s3 setup does this right out of the box. It just stores raw data from the stream to s3 buckets. But logically this data is the still the same stream of records. And to be able to reliable consume (read) this stream one needs these sequential numbers for checkpoints. And these numbers are records arrival times.
What if I want to read records by creation time? Looks like the proper way to accomplish this task is to read the s3 stream sequentially, dump it to some [time series] database or data warehouse and do creation-time-based readings against this storage. Otherwise there will be always a non-zero chance to miss some bunches of events while reading the s3 (stream). So I would not suggest the reordering of s3 buckets at all.
You'll need to do some post-processing or write a custom streaming consumer (such as Lambda) to do this.
We dealt with a huge event volume at my company, so writing a Lambda function didn't seem like a good use of money. Instead, we found batch-processing with Athena to be a really simple solution.
First, you stream into an Athena table, events, which can optionally be partitioned by an arrival-time.
Then, you define another Athena table, say, events_by_event_time which is partitioned by the event_time attribute on your event, or however it's been defined in the schema.
Finally, you schedule a process to run an Athena INSERT INTO query that takes events from events and automatically repartitions them to events_by_event_time and now your events are partitioned by event_time without requiring EMR, data pipelines, or any other infrastructure.
You can do this with any attribute on your events. It's also worth noting you can create a view that does a UNION of the two tables to query real-time and historic events.
I actually wrote more about this in a blog post here.
For future readers - Firehose supports Custom Prefixes for Amazon S3 Objects
https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
AWS started offering "Dynamic Partitioning" in Aug 2021:
Dynamic partitioning enables you to continuously partition streaming data in Kinesis Data Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding Amazon Simple Storage Service (Amazon S3) prefixes.
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html
Look at https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html. You can implement a lambda function which takes your records, processes them, changes the partition key and then sends them back to firehose to be added. You would also have the change the firehose to enable this partitioning and also define your custom partition key/prefix/suffix.