"LAMBDA_RUNTIME" Error on high-volume Lambda Function - amazon-web-services

I'm currently using a Lambda Function written in Javascript that is setup with an SQS event source to automatically pull messages from an SQS Queue and do some basic processing on the message contents. I cannot show the code but the summary of the lambda function's execution is basically:
For each message in the batch it receives as part of the event:
It parses the body, which is a JSON string, into a Javascript object.
It reads an object from S3 that is listed in the object using getObject.
It puts a record into a DynamoDB table using put.
If there were no errors, it deletes the individual SQS message that was processed from the Queue using deleteMessage.
This SQS queue is high-volume and receives messages in-bulk, regularly building up a backlog of millions of messages. The Lambda is normally able to scale to process hundreds of thousands of messages concurrently. This solution has worked well for me with other applications in the past but I'm now encountering the following intermittent error that reliably begins to appear as the Lambda scales up:
[ERROR] [#############] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 400.
I've been unable to find any information anywhere about what this error means and what causes it. There appears to be not discernible pattern as to which executions encounter it. The function is usually able to run for a brief period without encountering the error and scale to expected levels. But then, as you can see, the error starts to appear quite suddenly and completely destroys the Lambda throughput by forcing it to auto-scale down:
Does anyone know what this "LAMBDA_RUNTIME" error means and what might cause it? My Lambda Function runtime is Node v12.

Your function is being invoked asynchronously, so when it finishes it signals the caller if it was sucessful.
You should have an error some milliseconds earlier, probably an unhandled exception not being logged. If that's the case, your functions ends without knowing about the exception and tries to post a success response.

I have this error only that I get:
[ERROR] [1638918279694] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 413.
I went to the lambda function on aws console and ran the test with a custom event I build and the error I got there was:
{
"errorMessage": "Response payload size exceeded maximum allowed payload size (6291556 bytes).",
"errorType": "Function.ResponseSizeTooLarge"
}
So this is the actual error that cloudwatch doesn't return but the testing section of the lambda function console do.
I think I'll have to return info to an S3 file or something, but that's another matter.

Related

How does SQS keep track of messages?

I have a pretty standard setup of feeding SQS to Lambda. The lambda reads the message and makes a web request to a defined endpoint.
If I encounter an exception during processing of the SQS message that is due to the form of the message then I put the message on a dead letter queue.
If I encounter an error with the web request, I put the message back on the feeding queue to make the HTTP request at a later time.
This seems to work fine, but we just ran into an issue where an HTTP endpoint was down for 4 days and the feeding queue dropped the message. I imagine this has something to do with the retention period setting of the queue.
Questions
Is there a way to know, in the lambda, how many times a message has been replayed?
How did the feeder queue know that the message that was re-enqueued was the same as the one that was originally put on the queue?
I'm currently not explicitly deleting a message off the queue. Not having that, hasn't seemed to cause any issues, no re-processing of messages or anything. Should I be explicitly deleting them?
The normal process would be:
The AWS Lambda function is triggered, with the message(s) passed via the event parameter
If the Lambda function successfully processes the message(s), it should return a 'success' code (200) and the message is automatically removed from the queue
If the Lambda function is unable to process the message, it should return a 'failure' code (eg 400) and Amazon SQS will automatically attempt to re-process the message (unless it has exceeded the retry count)
If the Lambda function fails (eg due to a timeout), Amazon SQS will automatically attempt to re-process the message (unless it has exceeded the retry count)
If a message has exceeded its retry count, Amazon SQS will move the message to the Dead Letter Queue
To answer your questions:
If you wish to take responsibility for these activities yourself, you can use the ApproximateReceiveCount attribute on the message. In the request, it appears that you should add AttributeNames=['ApproximateReceiveCount'], but the documentation is a bit contradictory. You might need to use All instead.
Since you are sending a new message to the queue, Amazon SQS is not aware that it is the same message. The message is not 're-enqueued' since it is a new message.
When your Lambda function returns 'success' (200), the message is being deleted off the queue for you.
You might consider using the standard functionality for retries and Dead Letter Queues rather than implementing that logic yourself.

SQS Lambda Trigger with Visibility Timeout extension

I'm working on a solution where I have a SQS queue with Lambda trigger. My understanding is Lambda will receive messages in batches to be processed, and once Lambda function is successful, the messages in the SQS queue is automatically deleted. However, how do I only allow some of those messages to be deleted?
Let's assume this use case:
Lambda function receives a batch with 10 messages, and only 7 messages are valid and can be processed, and the other 3 messages needs to be reprocessed at later point.
My initial thought was I could update the visibility timeout via boto3.sqs.change_visibility_timeout for each of the 3 messages to have it reprocessed after the timeout, however, since overall lambda function execution is successful, all 10 messages are deleted from SQS queue.
Any suggestions?
Yes, by default, the Lambda function deletes all the messages upon success. You would need to handle this in your code, but not by changing the visibility timeout of the messages.
Add DLQ (dead-letter queue) that will actually handle the failed messages (messages go to DLQ after a certain number of failed attempts to be processed, depending on how you set it up)
You have few options here:
You can handle each item yourself, and delete messages that are processed successfully. In case of a message that's not successful, you can throw an error and it won't be deleted automatically by the lambda function
If you use JavaScript you can try with Middy
If you use Python, you can use Lambda Powertools Python
For AWS Lambdas with an SQS trigger, by default, when your function encounters an error processing one or more messages in a given batch, the entire batch is marked as a failure. All of the messages in the batch are made visible again in the queue. Depending on your redrive policy, you can end up repeatedly processing successful messages along with the failures.
Rather than change the visibility timeout, the simplest way to specify which messages should be retried later and which can safely be deleted from the queue is to change the function response type to ReportBatchItemFailures. This allows you to return a list of failed message ids, indicating that only those messages in the batch should be made visible again in the queue.
Here's what the reporting syntax looks like for a handler function in Node.js:
exports.handler = async (event) => {
// Process the event
const batchItemFailureResponse = {
batchItemFailures: [
{
itemIdentifier: "idFailedMessage1"
},
{
itemIdentifier: "idFailedMessage2"
},
{
itemIdentifier: "idFailedMessage3"
}
]
};
return batchItemFailureResponse;
};
There is more information to be found in the official documentation.
This response type is configured when setting a queue as an event source for the Lambda. If you're configuring from the console, navigate to the Lambda function page, select the Configuration tab, and then choose Triggers. Then choose Add Trigger and choose the SQS trigger type. In addition to providing the standard parameters, be sure to check the box under Report batch item failures after expanding Additional Settings. It should look something like this:
Add trigger with batch failure reporting
This parameter must be set when first creating the trigger.
This response type can also be defined if you use CloudFormation templates to provision your resources. See the AWS documentation for more information. Note that if you use AWS SAM event source mappings, the documentation suggests that adding FunctionResponseTypes to the YAML with ReportBatchItemFailures in the type list isn't supported. That is incorrect, the documentation is simply outdated. There is an open issue around addressing this oversight.
Finally, in addition to reporting batch item failures, you should provision a target DLQ (dead-letter queue) and determine a reasonable maximum receive count so that action can be taken on messages that fail repeatedly.

AWS Lambda triggered twice for a sigle SQS Message

I have a system where a Lambda is triggered with event source as an SQS Queue.Each message gets our own internal unique id to differentiate between two requests .
Now lambda deletes the message from the queue automatically after sqs invocation and keeps the message in inflight while processing it so duplicate processing of a unique message should never occur ideally.
But when I checked my logs a message with the same unique id was processed within 100 milliseconds of the time frame of each other.
So This seems like two lambdas were triggered for one message and something failed at the end of aws it was either visibility timeout or something else.I have read online that few others have gone through the same situation.
Can anyone who has gone through the same situation explain how did they solve it or people with current scalable systems who don't have this kind of issue can help me out with the reasons why I could be having it ?
Note:- One single message was successfully executed Twice this wasn't the case of retry on failure.
I faced a similar issue, where a lambda (let's call it lambda-1) is triggered through a queue, and lambda-1 further invokes lambda-2 'synchronously' (https://docs.aws.amazon.com/lambda/latest/dg/invocation-sync.html) and the message basically goes to inflight and return back after visibility timeout expiry and triggers lambda-1 again. This goes on in a loop.
As per the link above:
"For functions with a long timeout, your client might be disconnected
during synchronous invocation while it waits for a response. Configure
your HTTP client, SDK, firewall, proxy, or operating system to allow
for long connections with timeout or keep-alive settings."
Making async calls in lambda-1 can resolve this issue. In the case above, invoking lambda-2 with InvocationType='Event' returns back, which in-turn deletes the item from queue.

SQS Lambda Integration - what happens when an Exception is thrown

The document states that
A Lambda function can fail for any of the following reasons:
The function times out while trying to reach an endpoint.
The function fails to successfully parse input data.
The function experiences resource constraints, such as out-of-memory
errors or other timeouts.
For my case, I'm using C# Lambda with SQS integration
If the invocation fails or times out, every message in the batch will be returned to the queue, and each will be available for processing once the Visibility Timeout period expires
My question: What happen if I, using an SQS Lambda integration ( .NET )
My function throws an Exception
My SQS visibility timer is set to 15 minutes, max receive count is 1, DLQ setup
Will the function retry?
Will it be put into the DLQ when Exceptions are thrown after all retries?
The moment your code throws an unhandled/uncaught exception Lambda fails. If you have max receive count set to 1 the message will be sent to the DLQ after the first failure, it will not be retried. If your max receive count is set to 5 for example, the moment the Lambda function fails, the message will be returned to the queue after the visibility timeout has expired.
The reason for this behaviour is you are giving Lambda permissions to poll the queue on your behalf. If it gets a message it invokes a function and gives you a single opportunity to process that message. If you fail the message returns to the queue and Lambda continues polling the queue on your behalf, it does not care if the next message is the same as the failed message or a brand new message.
Here is a great blog post which helped me understand how these triggers work.

aws lambda function triggering multiple times for a single event

I am using aws lambda function to convert uploaded wav file in a bucket to mp3 format and later move file to another bucket. It is working correctly. But there's a problem with triggering. When i upload small wav files,lambda function is called once. But when i upload a large sized wav file, this function is triggered multiple times.
I have googled this issue and found that it is stateless, so it will be called multiple times(not sure this trigger is for multiple upload or a same upload).
https://aws.amazon.com/lambda/faqs/
Is there any method to call this function once for a single upload?
Short version:
Try increasing timeout setting in your lambda function configuration.
Long version:
I guess you are running into the lambda function being timed out here.
S3 events are asynchronous in nature and lambda function listening to S3 events is retried atleast 3 times before that event is rejected. You mentioned your lambda function is executed only once (with no error) during smaller sized upload upon which you do conversion and re-upload. There is a possibility that the time required for conversion and re-upload from your code is greater than the timeout setting of your lambda function.
Therefore, you might want to try increasing the timeout setting in your lambda function configuration.
By the way, one way to confirm that your lambda function is invoked multiple times is to look into cloudwatch logs for the event id (67fe6073-e19c-11e5-1111-6bqw43hkbea3) occurrence -
START RequestId: 67jh48x4-abcd-11e5-1111-6bqw43hkbea3 Version: $LATEST
This event id represents a specific event for which lambda was invoked and should be same for all lambda executions that are responsible for the same S3 event.
Also, you can look for execution time (Duration) in the following log line that marks end of one lambda execution -
REPORT RequestId: 67jh48x4-abcd-11e5-1111-6bqw43hkbea3 Duration: 244.10 ms Billed Duration: 300 ms Memory Size: 128 MB Max Memory Used: 20 MB
If not a solution, it will at least give you some room to debug in right direction. Let me know how it goes.
Any event Executing Lambda several times is due to retry behavior of Lambda as specified in AWS document.
Your code might raise an exception, time out, or run out of memory. The runtime executing your code might encounter an error and stop. You might run out concurrency and be throttled.
There could be some error in Lambda which makes the client or service invoking the Lambda function to retry.
Use CloudWatch logs to find the error and resolving it could resolve the problem.
I too faced the same problem, in my case it's because of application error, resolving it helped me.
Recently AWS Lambda has new property to change the default Retry nature. Set the Retry attempts to 0 (default 2) under Asynchronous invocation settings.
For some in-depth understanding on this issue, you should look into message delivery guarantees. Then you can implement a solution using the idempotent consumers pattern.
The context object contains information on which request ID you are currently handling. This ID won't change even if the same event fires multiple times. You could save this ID for every time an event triggers and then check that the ID hasn't already been processed before processing a message.
In the Lambda Configuration look for "Asynchronous invocation" there is an option "Retry attempts" that is the maximum number of times to retry when the function returns an error.
Here you can also configure Dead-letter queue service
Multiple retry can also happen due read time out. I fixed with '--cli-read-timeout 0'.
e.g. If you are invoking lambda with aws cli or jenkins execute shell:
aws lambda invoke --cli-read-timeout 0 --invocation-type RequestResponse --function-name ${functionName} --region ${region} --log-type Tail --```payload {""} out --log-type Tail \
I was also facing this issue earlier, try to keep retry count to 0 under 'Asynchronous Invocations'.