What happens when a lambda dies? - amazon-web-services

I am new to AWS so I am not sure what the behavior is when the following situation occurs.
Let's say I have a Kinesis stream with JSON data (and let's say every couple of min a few thousand messages gets inserted).
Now there is a Lambda function that gets invoked everytime a new msg is inserted into the Kinesis which reads the msg and does some processing before inserting into Redshift.
So what happens if there is some error and the Lambda function crashes while doing the processing and takes a few minutes or even a couple of hours(i don't know if that's even possible) to come back up. Will it continue reading the Kinesis from the last unread message or will it read from the latest inserted messages (as that is the invoking event).
Thanks in advance.

Lambda function crashes while doing the processing
This is possible.
and takes a few minutes or even a couple of hours(i don't know if that's even possible) to come back up.
This is not exactly possible.
A Lambda function is only allowed to run until it returns a response, throws an error, or the timeout timer fires, whichever comes first. It would never be a couple of hours.
Lambda will create a new container every time the function is invoked, unless it already has one standing by for you or you are hitting a concurrency limit (typically 1000+).
However... for Kinesis streams, what happens is a bit different because of the need for in-order processing.
Poll-based (or pull model) event sources that are stream-based: These consist of Kinesis Data Streams or DynamoDB. When a Lambda function invocation fails, AWS Lambda attempts to process the erring batch of records until the time the data expires, which can be up to seven days.
The exception is treated as blocking, and AWS Lambda will not read any new records from the shard until the failed batch of records either expires or is processed successfully. This ensures that AWS Lambda processes the stream events in order.
https://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html
So your Lambda function throwing an exception or running past its timeout will simply cause the Lambda service to destroy the container immediately and create a new one immediately and then retry the invocation with the exact same data again until the data expires (as dictated by Kinesis config).
The delay would typically be no longer than your timeout, or the time it takes for the exception to occur, plus some number of milliseconds (up to a few seconds, for a cold start). The timeout is individually configurable on your Lambda function itself, up to 15 minutes (but this max is probably much too long).
It's potentially important to remember a somewhat hidden detail here -- there is a system that is part of the Lambda service that is reading your Kinesis stream and then telling another part of the Lambda service to invoke your function, with the batch of records. The Lambda service (not your Lambda function) is checking the stream by pulling data -- the stream is not technically pushing data to Lambda. DynamoDB streams and SQS work similarly -- Lambda pulls data, and handles retries by re-invoking the function. The other service is not responsible for pushing data.

Related

SQS → Lambda Problem With maximumBatchingWindow

Our intention is to trigger a lambda when messages are received in an SQS queue.
we only want one invocation of the lambda to run at a time (maximum concurrency of one)
We would like for the lambda to be triggered every time one of the following is true:
There are 10,000 messages in the queue
Five minutes has passed since the last invocation of the lambda
Our consumer lambda is dealing with an API with limited API calls and strict concurrency limits. The above solution ensures we never encounter concurrency issues and we can batch our calls together, ensuring we never consume too many API calls.
Here is our serverless.yml configuration
functions:
sqs-consumer:
name: sqs-consumer
handler: handlers.consume_handler
reservedConcurrency: 1 // maximum concurrency of 1
events:
- sqs:
arn: !GetAtt
- SqsQueue
- Arn
batchSize: 10000
maximumBatchingWindow: 300
timeout: 900
resources:
Resources:
SqsQueue:
Type: 'AWS::SQS::Queue'
Properties:
QueueName: sqs-queue
VisibilityTimeout: 5400 # 6x greater than the lambda timeout
The above does not give us the desired behavior. We are seeing our lambda triggered every 1 to 3 minutes (instead of 5). It indeed is using batches because we’ll see multiple messages being processed in a single invocation, but with even just one or two messages in the queue at a time it doesn’t wait 5 minutes to trigger the lambda.
Our messages are extremely small, so it's not possible we're coming anywhere close to the 6mb limit.
We would expect the only time the lambda is triggered to be when either 10,000 messages have accumulated in the queue or five minutes have transpired since the previous invocation. Instead we are seeing the lambda invoked anywhere in between every 1 to 3 minutes with a batch size that never even breaks 100, much less 10,000.
The largest batch size I’ve seen it invoke the lambda with so far has been 28, and sometimes with only one message in the queue it’ll invoke the function when it’s only been one minute since the previous invocation.
We would like to avoid using Kinesis, as the volume we’re dealing with truly doesn’t warrant it.
Reply from AWS Support:
As per the Case ID 10802672001, I understand that you have an SQS
event source mapping on Lambda with a batch size of 500 and batch
Window of 60 seconds. I further understand that you have observed the
lambda function invocation has fewer messages than 500 in a batch and
is not waiting for batch window time configured while receiving the
messages. You would like to know why lambda is being invoked prior to
meeting any of the above configured conditions and seek our assistance
in troubleshooting the same. Please correct me if I misunderstood your
query by any means.
Initially, I would like to thank you for sharing the detailed
correspondence along with the screenshot of the logs, it was indeed
very helpful in troubleshooting the issue.
Firstly, I used the internal tools to check the configuration of your
lambda function "sd_dch_archivebatterydata" and observed that there
is no throttling in the lambda function and there is no reserved
concurrency configured. As you might already be aware that Lambda is
meant to scale while polling from SQS queues and thus it is
recommended not to use reserving concurrency, as it is going against
the design of the event source. On checking log screenshot shared by
you, I observed there were no errors.
Regarding your query, please allow me to answer them as follows:
Please understand here that Batch size is the maximum number of messages that lambda will read from the queue in one batch for a
single invocation. It should be considered as the maximum number of
messages (up to) that can be received in a single batch but not as a
fixed value that can be received at all times in a single invocation.
-- Please see "When Lambda invokes the target function, the event can contain multiple items, up to a configurable maximum batch size" in
the official documentation here [1] for more information on the same.
I would also like to add that, according to the internal architecture of how the SQS service is designed, Lambda pollers will
poll the messages from the queue using the "ReceiveMessage" API
calls and invokes the Lambda function.
-- Please refer the documentation [2] which states the following "If the number of messages in the queue is small (fewer than 1,000), you
most likely get fewer messages than you requested per ReceiveMessage
call. If the number of messages in the queue is extremely small, you
might not receive any messages in a particular ReceiveMessage
response. If this happens, repeat the request".
-- Thus, we can see that the number of messages that can be obtained in a single lambda invocation with a certain batch size depends on the
number of messages in an SQS queue and the SQS service internal
implementation.
Also, batch window is the maximum amount of time that the poller waits to gather the messages from the queue before invoking the
function. However, this applies when there are no messages in the
queue. Thus, as soon as there is a message in the queue, the Lambda
function will be invoked without any further due without waiting for
the batch window time specified. You can refer to the
"WaitTimeSeconds" parameter in the "ReceiveMessage" API.
-- The batch window just ensures that lambda starts polling after certain time so that enough messages are present in the queue.
However, there are other factors like size of messages, incoming
volume, etc that can affect this behavior.
Additionally, I would like to confirm that Polls from SQS in Lambda is of Synchronous invocation type and it has an invocation payload
limit size of 6MB. Please refer the following AWS Documentation for
more information on the same [3].
Having said that, I can confirm that this Lambda polling behaviour is
by design and not a bug. Please be rest assured that there are no
issues with the lambda and SQS service.
Our scenario is to archive to S3, and we want fewer larger files. Looks like our options are potentially kinesis, or running a custom receive application on something like ECS...

SNS > AWS Lambda asyncronous invocation queue vs. SNS > SQS > Lambda

Background
This archhitecture relies solely on Lambda's asyncronous invocation mechanism as described here:
https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html
I have a collector function that is invoked once a minute and fetches a batch of data in that can vary drastically in size (tens of of KB to potentially 1-3MB). The data contains a JSON array containing one-to-many records. The collector function segregates these records and publishes them individually to an SNS topic.
A parser function is subribed the SNS topic and has a concurrency limit of 3. SNS asynchronously invokes the parser function per record, meaning that the built-in AWS managed Lambda asyncronous queue begins to fill up as the instances of the parser maxes out at 3. The Lambda queueing mechanism initiates retries at incremental backups when throttling occurs, until the invocation request can be processed by the parser function.
It is imperitive that a record does not get lost during this process as they can not be resurrected. I will be using dead letter queues where needed to ensure they ultimately end up somewhere in case of error.
Testing this method out resulted in no lost invocation. Everything worked as expected. Lambda reported hundreds of throttle responses but I'm relying on this to initiate the Lambda retry behaviour for async invocations. My understanding is that this behaivour is effectively the same as that which I'd have to develop and initiate myself if I wanted to retry consuming a message coming from SQS.
Questions
1. Is the built-in AWS managed Lambda asyncronous queue reliable?
The parser could be subject to a consistent load of 200+ invocations per minute for prelonged periods so I want to understand whether the Lambda queue can handle this as sensibly as an SQS service. The main part that concerns me is this statement:
Even if your function doesn't return an error, it's possible for it to receive the same event from Lambda multiple times because the queue itself is eventually consistent. If the function can't keep up with incoming events, events might also be deleted from the queue without being sent to the function. Ensure that your function code gracefully handles duplicate events, and that you have enough concurrency available to handle all invocations.
This implies that an incoming invocation may just be deleted out of thin air. Also in my implementation I'm relying on the retry behaviour when a function throttles.
2. When a message is in the queue, what happens when the message timeout is exceeded?
I can't find a difinitive answer but I'm hoping the message would end up in the configured dead letter queue.
3. Why would I use SQS over the Lambda queue when SQS presents other problems?
See the articles below for arguments against SQS. Overpulling (described in the second link) is of particular concern:
https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes/
https://medium.com/#zaccharles/lambda-concurrency-limits-and-sqs-triggers-dont-mix-well-sometimes-eb23d90122e0
I can't find any articles or discussions of how the Lambda queue performs.
Thanks for reading!
Quite an interesting question. There's a presentation that covered queues in detail. I can't find it at the moment. The premise is the same as this queues are leaky buckets
So what if I add more Leaky Buckets. We'll you've delayed the leaking, however it's now leaking into another bucket. Have you solved the problem or delayed it?
What if I vibrate the buckets at different frequencies?
Further reading:
operate lambda
message expiry
message timeout
DDIA / DDIA Online
SQS Performance
sqs failure modes
mvce is missing from this question so I cannot address the the precise problem you are having.
As for an opinion on which to choose for SQS and Lambda queue I'll point to the Meta on this
sqs faq mentions Kinesis streams
sqs sns kinesis comparison
TL;DR;
It depends
I think the biggest advantage of using your own queue is the fact that you as a user have visibility into the state of the your backpressure.
Using the Lambda async invoke method, you have the potential to get throttled exceptions with the 'guarantee' that lambda will retry over an interval. If using a SQS source queue instead, you have complete visibility into the state of your message processing at all times with no ambiguity.
Secondly regarding overpulling. In theory this is a concern but in practice its never happened to me. I've run applications requiring thousands of transactions per second and never once had problems with SQS -> Lambda. Obviously set your retry policy appropriately and use a DLQ as transient/unpredictable errors CAN occur.

What happens to the events in Dyanamo DB Stream once its received by AWS Lambda

I have a DynamoDB Table and it is linked with one Stream, and that stream is linked with one lambda function which processed it.
Question - With above set up if an event comes to the stream and is ingested in Lambda, does that event still resides in that stream or it gets POPPED out as soon as it got ingested in Lambda just like a Queue?
Question 2 Can someone kindly tell me about the inner working of DDB Stream and how it passes the data to Lambda? Like are there any states for the stream events?
P.S: AWS Documentation says that events stay in stream for 24 hour window.
There are two concepts to understand here
Streams
Triggers
Whenever there is a change in the table like an addition, update or deletion, the Kinesis Stream feature of AWS stores that change for a period of 24 hrs. It does this through 4 methods:-
Keys only:- only the keys are stored after the change
New image:- The entire item on which the change is performed is stored
Old image:- When a change is performed on an item, the old item is stored instead of the new one
New and old:- self-explanatory
To associate a lambda function with your streams, a feature called Triggers are used. The changes invoke the Trigger which in-turn performs the lambda function associated with the change.
Part 1 of your question:-
By default, Lambda invokes your function as soon as records are available in the stream. If the batch it reads from the stream only has one record in it, Lambda only sends one record to the function. To avoid invoking the function with a small number of records, you can tell the event source to buffer records for up to 5 minutes by configuring a batch window. Before invoking the function, Lambda continues to read records from the stream until it has gathered a full batch, or until the batch window expires. If the Lambda fails it will try and process that message indefinitely (or until it expires), keeping other messages from being processed as a result. To avoid stalled shards(I'll talk about this later), you can configure the event source mapping to retry with smaller batch size, limit the number of retries, or discard records that are too old(you can set the age of the record that lambda can read).
Part 2 of your question:-
The streams which we are talking about are Kinesis Streams It is a feature to be used by multiple producers and consumers. Here the producer is DynamoDb and the consumer is lambda. Consumers have dedicated read throughput so they don't have to compete with other consumers of the same data. With consumers, Kinesis pushes records to Lambda over an HTTP/2 connection, which can also reduce latency between adding a record and function invocation.
The capacity of the streams is determined by the number of shards it contains. Shards are small units of capacity in the Stream. Hence higher the shard value, higher the capacity.
I guess I have explained the working in the part1 of this answer. Feel free to ask follow up questions.

Tracking a message on multiple AWS SQS queues

I have around 3 AWS Lambda functions taking the following form:
Lambda function 1: Reads from an SQS queue and puts a Message on an SQS queue (the incoming and outgoing message formats are different)
Lambda function 2: Reads the message from Lambda function 1, and puts a Message on an SQS queue (the incoming and outgoing message formats are different)
Lambda function 3: Reads the message from Lambda function 3, and updates storage.
There are 3 queues involved and the message format (structure) in each queue is different, however they have one uniqueId which are the same and can be used the relate one to each other. So much question is, is there any way in SQS on or some other tool to track the messages, what I'm specifically looking at is stuff like:
Time the message was entered into the queue
Time the message was taken by the Lambda function for processing
My problem is that the 3 Lambda functions individually perform within a couple of milliseconds, but the time taken for end to end execution is way too long, I suspect that the messages are taking too long in transit.
I'm open for any other ideas on optimisation.
AWS Step Functions is specifically designed for passing information between Lambda functions and orchestrating the whole process.
However, you would need to change the way you have written your functions to take advantage of Step Functions.
Since your only real desire is to explore why it takes "way too long", then AWS X-Ray should be a good way to gather this information. It can track a single transaction end-to-end through each process. I think it's just a matter of including a library and activating X-Ray.
See: AWS Lambda and AWS X-Ray - AWS X-Ray
Or, just start doing some manual investigations in the log files. The log shows how long each function takes to run, so you should be able to identify whether the time taken is within a particular Lambda function, or whether the time is spent between functions, waiting for them to trigger.

Amazon Kinesis & AWS Lambda Retries

I'm very new to Amazon Kinesis so maybe this is just a problem in my understanding but in the AWS Lambda FAQ it says:
The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel.
My question is, what happens if for some reason some malformed data gets put onto a shard by a producer and when the Lambda function picks it up it errors out and then just keeps retrying constantly? This then means that the processing of that particular shard would be blocked for 24 hours by the error.
Is the best practice to handle application errors like that by wrapping the problem in a custom error and sending this error downstream along with all the successfully processed records and let the consumer handle it? Of course, this still wouldn't help in the case of an unrecoverable error that crashed the program like a null pointer: again we'd be back to the blocking retry loop for the next 24 hours.
Don't overthink it, the Kinesis is just a queue. You have to consume a record (ie. pop from the queue) successfully in order to proceed to the next one. Just like a FIFO stack.
The appropriate approach should be:
Get a record from stream.
Process it in a try-catch-finally block.
If the record is processed successfully, no problem. <- TRY
But if it fails, note it down to another place to investigate the
reason why it failed. <- CATCH
And at the end of your logic blocks, always persist the position to
DynamoDB. <- FINALLY
If an internal occurs in your system (memory error, hardware error
etc) that is another story; as it may affect processing all of the
records, not just one.
By the way, if processing of a record takes more than 1 minute, it is obvious you are doing something wrong. Because Kinesis is designed to handle thousands of records per second, you should not have the luxury of processing such long jobs for each of them.
The question you are asking is a general problem of queue systems, sometimes called "poisonous message". You have to handle them in your business logic to be safe.
http://www.cogin.com/articles/SurvivingPoisonMessages.php#PoisonMessages
This is a common question on processing events in Kinesis and I'll try to give you some points to build your Lambda function to handle such issues with "corrupted" data. Since it is best practice to have separated parts of your system writing to the Kinesis stream and other parts reading from the Kinesis stream, it is common that you will have such problems.
First, why do you have such problematic events?
Using Kinesis to process your events is a good way to break up a complex system that is doing both front-end processing (serving end users), and at the same time/code back-end processing (analyzing events), into two independent parts of your system. The front-end people can focus on their business, while the back-end people don't need to push code changes to the front-end, if they want to add functionality to serve their analytic use cases. Kinesis is a buffer of events that both breaks the need for synchronization as well simplifies the business logic code.
Therefore, we would like events written to the stream to be flexible in their "schema", and if the front-end teams wish to change the event format, add fields, delete fields, change the protocol or the encryption keys, they should be able to do that as often as they want.
Now it is up to the teams that are reading from the stream to be able to process such flexible events in an efficient way, and not break their processing every time such change is happening. Therefore, it should be common that your Lambda function will see events that it can't process, and "poison-pill" is not that rare event as you might expect.
Second, how do you handle such problematic events?
Your Lambda function will get a batch of events to process. Please note that you shouldn't get the events one by one, but in large batches of events. If your batches are too small, you will quickly get large lags on the stream.
For each batch you will iterate over the events, process them and then check-point in DynamoDB the last sequence-id of the batch. Lambda is doing most of these steps automatically with (see more here: http://docs.aws.amazon.com/lambda/latest/dg/walkthrough-kinesis-events-adminuser-create-test-function.html):
console.log('Loading function');
exports.handler = function(event, context) {
console.log(JSON.stringify(event, null, 2));
event.Records.forEach(function(record) {
// Kinesis data is base64 encoded so decode here
payload = new Buffer(record.kinesis.data, 'base64').toString('ascii');
console.log('Decoded payload:', payload);
});
context.succeed();
};
This is what is happening in the "happy path", if all the events are processed without any problem. But if you encounter any problem in the batch and you don't "commit" the events with the success notification, the batch will fail and you will get all the events in the batch again.
Now you need to decide what is the reason of the failure in the processing.
Temporary problem (throttling, network issue...) - it is OK to wait a second and try again for a couple of times. In many cases the issue will resolve itself.
Occasional problem (out of memory...) - it is best to increase the memory allocation of the Lambda function or decrease the batch size. In many cases such modification will resolve the issue.
Constant failure - it means that you have to either ignore the problematic event (put it in a DLQ - dead-letter-queue) or modify your code to handle it.
The problem is to identify the type of failure in your code and handle it differently. You need to write your Lambda code in a way to identify it (type of exception, for example) and react differently.
You can use the integration with CloudWatch to write such failures to the console and create the relevant alarms. You can use the CloudWatch Logs also as a way to log your "dead-letter-queue" and see what is the source of problem.
In your lambda you can either throw an error and thus returning back the whole batch, or you can not throw an error and instead push it to an SQS queue to handle those messages differently. SQS has a retention period of 14 days. You could also have checkpoints with each record to know if the record was processed in the previous run.
If you have a lot of incoming data and you don't want to introduce any latency you could just ignore the error and just move on while adding those events to an SQQ queue.