What happens to the events in Dyanamo DB Stream once its received by AWS Lambda - amazon-web-services

I have a DynamoDB Table and it is linked with one Stream, and that stream is linked with one lambda function which processed it.
Question - With above set up if an event comes to the stream and is ingested in Lambda, does that event still resides in that stream or it gets POPPED out as soon as it got ingested in Lambda just like a Queue?
Question 2 Can someone kindly tell me about the inner working of DDB Stream and how it passes the data to Lambda? Like are there any states for the stream events?
P.S: AWS Documentation says that events stay in stream for 24 hour window.

There are two concepts to understand here
Streams
Triggers
Whenever there is a change in the table like an addition, update or deletion, the Kinesis Stream feature of AWS stores that change for a period of 24 hrs. It does this through 4 methods:-
Keys only:- only the keys are stored after the change
New image:- The entire item on which the change is performed is stored
Old image:- When a change is performed on an item, the old item is stored instead of the new one
New and old:- self-explanatory
To associate a lambda function with your streams, a feature called Triggers are used. The changes invoke the Trigger which in-turn performs the lambda function associated with the change.
Part 1 of your question:-
By default, Lambda invokes your function as soon as records are available in the stream. If the batch it reads from the stream only has one record in it, Lambda only sends one record to the function. To avoid invoking the function with a small number of records, you can tell the event source to buffer records for up to 5 minutes by configuring a batch window. Before invoking the function, Lambda continues to read records from the stream until it has gathered a full batch, or until the batch window expires. If the Lambda fails it will try and process that message indefinitely (or until it expires), keeping other messages from being processed as a result. To avoid stalled shards(I'll talk about this later), you can configure the event source mapping to retry with smaller batch size, limit the number of retries, or discard records that are too old(you can set the age of the record that lambda can read).
Part 2 of your question:-
The streams which we are talking about are Kinesis Streams It is a feature to be used by multiple producers and consumers. Here the producer is DynamoDb and the consumer is lambda. Consumers have dedicated read throughput so they don't have to compete with other consumers of the same data. With consumers, Kinesis pushes records to Lambda over an HTTP/2 connection, which can also reduce latency between adding a record and function invocation.
The capacity of the streams is determined by the number of shards it contains. Shards are small units of capacity in the Stream. Hence higher the shard value, higher the capacity.
I guess I have explained the working in the part1 of this answer. Feel free to ask follow up questions.

Related

Kinesis Stream Consumption: LATEST v/s TRIM_HORIZON

I have a use case where I need to keep the Kinesis trigger on my consumer (let's call it Lambda B) be disabled, while the producer (let's call it Lambda A) writes to the Kinesis stream. Once the write is complete, I intend to enable the trigger, and Lambda B should be able to process the data present in Kinesis stream. With this situation in mind:
LATEST - This implies that records written to stream after enabling the Lambda trigger will be processed. Records written during the disabled phase will not be processed. Doesn't suit my use case. Discarded.
TRIM_HORIZON - All the records in the stream will be processed. Okay, this works for my use case. BUT I'm imagining a case where the trigger goes enabled(1) -> disabled -> enabled(2). After enabled(2), Lambda B will read records put in during the disabled state. That is fine. But will it also read records that were already read during the enabled(1) phase (since Kinesis retains records for 24 hours)? If so then this is an issue.
AT_TIMESTAMP - This requires manually putting in timestamps which I do not want to do. Discarded.
You only need to pick the initial starting point when you create the event source mapping. Once the mapping is created, and has processed messages, it will remember the last messages that it processed.
So a sequence like this will give you what you want:
Create Lambda B and event source mapping with TRIM_HORIZON (or LATEST; there aren't any messages in the stream, so they're functionally equivalent).
Disable event source.
Start Lambda A to write messages to stream.
Enable event source. Lambda B will process the messages in the stream.
Repeat steps 2-4 as needed.

Dynamodb stream lambda, how to infinite block a partition-key stream

I’m using Dynamodb stream lambda to maintain a kind of business sequence logic.
In case of failure, I can only block processing on the affected records for up to one day. Once they are expired, lambda will continue to process the next non-expired ones. Thus my sequence logic will be corrupted.
My understanding is Dynamodb stream lambda ensures in-order processing at the table partition-key level.
It would be great if I could implement a partial infinite blocking for a specific sub-stream (aka a table partition-key stream).
I’m thinking to use:
MaximumRecordAgeInSeconds to limit the retry logic (up to a few hours to allow hotfixes)
an SQS FIFO queue as a destination for discarded records
a second Dynamodb table for lock purpose (pay per request billing mode)
The lambda handler has to be aware of MaximumRecordAgeInSeconds and set the lock just before discarding the batch of records in case of error.
The lock value will force the following records of the concerned partition-key sub-stream to fail.
Finally, an SQS consumer will deal with the blocking issue, process the discarded records from the queue, then unlock the sub-stream.
The locking logic looks hacky and still has some pitfalls
Can I do better?

What happens when a lambda dies?

I am new to AWS so I am not sure what the behavior is when the following situation occurs.
Let's say I have a Kinesis stream with JSON data (and let's say every couple of min a few thousand messages gets inserted).
Now there is a Lambda function that gets invoked everytime a new msg is inserted into the Kinesis which reads the msg and does some processing before inserting into Redshift.
So what happens if there is some error and the Lambda function crashes while doing the processing and takes a few minutes or even a couple of hours(i don't know if that's even possible) to come back up. Will it continue reading the Kinesis from the last unread message or will it read from the latest inserted messages (as that is the invoking event).
Thanks in advance.
Lambda function crashes while doing the processing
This is possible.
and takes a few minutes or even a couple of hours(i don't know if that's even possible) to come back up.
This is not exactly possible.
A Lambda function is only allowed to run until it returns a response, throws an error, or the timeout timer fires, whichever comes first. It would never be a couple of hours.
Lambda will create a new container every time the function is invoked, unless it already has one standing by for you or you are hitting a concurrency limit (typically 1000+).
However... for Kinesis streams, what happens is a bit different because of the need for in-order processing.
Poll-based (or pull model) event sources that are stream-based: These consist of Kinesis Data Streams or DynamoDB. When a Lambda function invocation fails, AWS Lambda attempts to process the erring batch of records until the time the data expires, which can be up to seven days.
The exception is treated as blocking, and AWS Lambda will not read any new records from the shard until the failed batch of records either expires or is processed successfully. This ensures that AWS Lambda processes the stream events in order.
https://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html
So your Lambda function throwing an exception or running past its timeout will simply cause the Lambda service to destroy the container immediately and create a new one immediately and then retry the invocation with the exact same data again until the data expires (as dictated by Kinesis config).
The delay would typically be no longer than your timeout, or the time it takes for the exception to occur, plus some number of milliseconds (up to a few seconds, for a cold start). The timeout is individually configurable on your Lambda function itself, up to 15 minutes (but this max is probably much too long).
It's potentially important to remember a somewhat hidden detail here -- there is a system that is part of the Lambda service that is reading your Kinesis stream and then telling another part of the Lambda service to invoke your function, with the batch of records. The Lambda service (not your Lambda function) is checking the stream by pulling data -- the stream is not technically pushing data to Lambda. DynamoDB streams and SQS work similarly -- Lambda pulls data, and handles retries by re-invoking the function. The other service is not responsible for pushing data.

Amazon Kinesis & AWS Lambda Retries

I'm very new to Amazon Kinesis so maybe this is just a problem in my understanding but in the AWS Lambda FAQ it says:
The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel.
My question is, what happens if for some reason some malformed data gets put onto a shard by a producer and when the Lambda function picks it up it errors out and then just keeps retrying constantly? This then means that the processing of that particular shard would be blocked for 24 hours by the error.
Is the best practice to handle application errors like that by wrapping the problem in a custom error and sending this error downstream along with all the successfully processed records and let the consumer handle it? Of course, this still wouldn't help in the case of an unrecoverable error that crashed the program like a null pointer: again we'd be back to the blocking retry loop for the next 24 hours.
Don't overthink it, the Kinesis is just a queue. You have to consume a record (ie. pop from the queue) successfully in order to proceed to the next one. Just like a FIFO stack.
The appropriate approach should be:
Get a record from stream.
Process it in a try-catch-finally block.
If the record is processed successfully, no problem. <- TRY
But if it fails, note it down to another place to investigate the
reason why it failed. <- CATCH
And at the end of your logic blocks, always persist the position to
DynamoDB. <- FINALLY
If an internal occurs in your system (memory error, hardware error
etc) that is another story; as it may affect processing all of the
records, not just one.
By the way, if processing of a record takes more than 1 minute, it is obvious you are doing something wrong. Because Kinesis is designed to handle thousands of records per second, you should not have the luxury of processing such long jobs for each of them.
The question you are asking is a general problem of queue systems, sometimes called "poisonous message". You have to handle them in your business logic to be safe.
http://www.cogin.com/articles/SurvivingPoisonMessages.php#PoisonMessages
This is a common question on processing events in Kinesis and I'll try to give you some points to build your Lambda function to handle such issues with "corrupted" data. Since it is best practice to have separated parts of your system writing to the Kinesis stream and other parts reading from the Kinesis stream, it is common that you will have such problems.
First, why do you have such problematic events?
Using Kinesis to process your events is a good way to break up a complex system that is doing both front-end processing (serving end users), and at the same time/code back-end processing (analyzing events), into two independent parts of your system. The front-end people can focus on their business, while the back-end people don't need to push code changes to the front-end, if they want to add functionality to serve their analytic use cases. Kinesis is a buffer of events that both breaks the need for synchronization as well simplifies the business logic code.
Therefore, we would like events written to the stream to be flexible in their "schema", and if the front-end teams wish to change the event format, add fields, delete fields, change the protocol or the encryption keys, they should be able to do that as often as they want.
Now it is up to the teams that are reading from the stream to be able to process such flexible events in an efficient way, and not break their processing every time such change is happening. Therefore, it should be common that your Lambda function will see events that it can't process, and "poison-pill" is not that rare event as you might expect.
Second, how do you handle such problematic events?
Your Lambda function will get a batch of events to process. Please note that you shouldn't get the events one by one, but in large batches of events. If your batches are too small, you will quickly get large lags on the stream.
For each batch you will iterate over the events, process them and then check-point in DynamoDB the last sequence-id of the batch. Lambda is doing most of these steps automatically with (see more here: http://docs.aws.amazon.com/lambda/latest/dg/walkthrough-kinesis-events-adminuser-create-test-function.html):
console.log('Loading function');
exports.handler = function(event, context) {
console.log(JSON.stringify(event, null, 2));
event.Records.forEach(function(record) {
// Kinesis data is base64 encoded so decode here
payload = new Buffer(record.kinesis.data, 'base64').toString('ascii');
console.log('Decoded payload:', payload);
});
context.succeed();
};
This is what is happening in the "happy path", if all the events are processed without any problem. But if you encounter any problem in the batch and you don't "commit" the events with the success notification, the batch will fail and you will get all the events in the batch again.
Now you need to decide what is the reason of the failure in the processing.
Temporary problem (throttling, network issue...) - it is OK to wait a second and try again for a couple of times. In many cases the issue will resolve itself.
Occasional problem (out of memory...) - it is best to increase the memory allocation of the Lambda function or decrease the batch size. In many cases such modification will resolve the issue.
Constant failure - it means that you have to either ignore the problematic event (put it in a DLQ - dead-letter-queue) or modify your code to handle it.
The problem is to identify the type of failure in your code and handle it differently. You need to write your Lambda code in a way to identify it (type of exception, for example) and react differently.
You can use the integration with CloudWatch to write such failures to the console and create the relevant alarms. You can use the CloudWatch Logs also as a way to log your "dead-letter-queue" and see what is the source of problem.
In your lambda you can either throw an error and thus returning back the whole batch, or you can not throw an error and instead push it to an SQS queue to handle those messages differently. SQS has a retention period of 14 days. You could also have checkpoints with each record to know if the record was processed in the previous run.
If you have a lot of incoming data and you don't want to introduce any latency you could just ignore the error and just move on while adding those events to an SQQ queue.

Status of kinesis stream reader

How do I tell what percentage of the data in a Kinesis stream a reader has already processed? I know each reader has a per-shard checkpoint sequence number, and I can also get the StartingSequenceNumber of each shard from describe-stream, however, I don't know how far along in my data the reader currently is (I don't know the latest sequence number of the shard).
I was thinking of getting a LATEST iterator for each shard and getting the last record's sequence number, however that doesn't seem to work if there's no new data since I got the LATEST iterator.
Any ideas or tools for doing this out there?
Thanks!
I suggest you implement a custom metric or metrics in your applications to track this.
For example, you could append a message send time within your Kinesis message, and on processing the message, record the time difference as an AWS CloudWatch custom metric. This would indicate how close your consumer is to the front of the stream.
You could also record the number of messages pushed (at the pushing application) and messages received at the Kinesis consumer. If you compare these in a chart on CloudWatch, you could see that the curves roughly follow each other indicating that the consumer is doing a good job at keeping up with the workload.
You could also try monitoring your Kinesis consumer, to see how often it idly waits for records (i.e, no results are returned by Kinesis, suggesting it is at the front of the stream and all records are processed)
Also note there is not a way to track a "percent" processed in the stream, since Kinesis messages expire after 24 hours (so the total number of messages is constantly rolling). There is also not a direct (API) function to count the number of messages inside your stream (unless you have recorded this as above).
If you use KCL you can do that by comparing IncomingRecords from the cloudwatch built-in metrics of Kinesis with RecordsProcessed which is a custom metric published by the KCL.
Then you select a time range and interval of say 1 day.
You would then get the following type of graphs:
As you can see there were much more records added than processed. By looking at the values in each point you will know exactly if your processor is behind or not.