The document states that
A Lambda function can fail for any of the following reasons:
The function times out while trying to reach an endpoint.
The function fails to successfully parse input data.
The function experiences resource constraints, such as out-of-memory
errors or other timeouts.
For my case, I'm using C# Lambda with SQS integration
If the invocation fails or times out, every message in the batch will be returned to the queue, and each will be available for processing once the Visibility Timeout period expires
My question: What happen if I, using an SQS Lambda integration ( .NET )
My function throws an Exception
My SQS visibility timer is set to 15 minutes, max receive count is 1, DLQ setup
Will the function retry?
Will it be put into the DLQ when Exceptions are thrown after all retries?
The moment your code throws an unhandled/uncaught exception Lambda fails. If you have max receive count set to 1 the message will be sent to the DLQ after the first failure, it will not be retried. If your max receive count is set to 5 for example, the moment the Lambda function fails, the message will be returned to the queue after the visibility timeout has expired.
The reason for this behaviour is you are giving Lambda permissions to poll the queue on your behalf. If it gets a message it invokes a function and gives you a single opportunity to process that message. If you fail the message returns to the queue and Lambda continues polling the queue on your behalf, it does not care if the next message is the same as the failed message or a brand new message.
Here is a great blog post which helped me understand how these triggers work.
Related
I have a pretty standard setup of feeding SQS to Lambda. The lambda reads the message and makes a web request to a defined endpoint.
If I encounter an exception during processing of the SQS message that is due to the form of the message then I put the message on a dead letter queue.
If I encounter an error with the web request, I put the message back on the feeding queue to make the HTTP request at a later time.
This seems to work fine, but we just ran into an issue where an HTTP endpoint was down for 4 days and the feeding queue dropped the message. I imagine this has something to do with the retention period setting of the queue.
Questions
Is there a way to know, in the lambda, how many times a message has been replayed?
How did the feeder queue know that the message that was re-enqueued was the same as the one that was originally put on the queue?
I'm currently not explicitly deleting a message off the queue. Not having that, hasn't seemed to cause any issues, no re-processing of messages or anything. Should I be explicitly deleting them?
The normal process would be:
The AWS Lambda function is triggered, with the message(s) passed via the event parameter
If the Lambda function successfully processes the message(s), it should return a 'success' code (200) and the message is automatically removed from the queue
If the Lambda function is unable to process the message, it should return a 'failure' code (eg 400) and Amazon SQS will automatically attempt to re-process the message (unless it has exceeded the retry count)
If the Lambda function fails (eg due to a timeout), Amazon SQS will automatically attempt to re-process the message (unless it has exceeded the retry count)
If a message has exceeded its retry count, Amazon SQS will move the message to the Dead Letter Queue
To answer your questions:
If you wish to take responsibility for these activities yourself, you can use the ApproximateReceiveCount attribute on the message. In the request, it appears that you should add AttributeNames=['ApproximateReceiveCount'], but the documentation is a bit contradictory. You might need to use All instead.
Since you are sending a new message to the queue, Amazon SQS is not aware that it is the same message. The message is not 're-enqueued' since it is a new message.
When your Lambda function returns 'success' (200), the message is being deleted off the queue for you.
You might consider using the standard functionality for retries and Dead Letter Queues rather than implementing that logic yourself.
Im curious to know, is the visibility timeout respected if an exception is thrown by the app after taking a message off the sqs qeue..
Im of the impression it is respected, so if an exception is thrown in the application say validation failed, the message remains on queue before being moved to the DLQ? ie until timeout elapses.
is that correct?
When your application calls ReceiveMessage(), a message is returned and that message is made 'invisible' in the queue. When your application has finished processing the message, it should call DeleteMessage() to remove the message from the queue.
If that is not done within the invisibility timeout period, then the message will 'reappear' on the queue so that it can be re-processed. This means that if your application throws an exception and does not complete processing the message, then the message will be re-processed.
However, if this happens multiple times, then the message will be moved to the Dead Letter Queue. This prevents a situation where the message continually fails to be processed. When configuring the SQS Queue, you can configure the number of retries before sending the message to the Dead Letter Queue.
Amazon SQS requires the timeout because that's how it knows that processing has failed. It has no insight into the actual application failure -- it can only assume it since the timeout period has passed. For a 'quicker' response, you can use a very small timeout period, and then have your application send a 'heartbeat' signal to SQS at frequent intervals, telling it that the message is still being processed, and therefore resetting the invisibility timeout period.
We have a Lambda event source that polls SQS for messages and passes them on to a Lambda function.
The SQS has a Visibility Timeout that dictates how long after a consumer (lambda) picks up a message it will re-appear in the queue if not successfully completed.
Is there a way from the Lambda function to force this to happen as soon as we detect an error?
For example, say Lambda has a timeout of 10 mins. The SQS Visibility Timeout will need to be longer than that.
But if the Lambda function detects a problem early on, and throws an error - this will fail the message, but we have to wait for the SQS Visibility Timeout before the message is available for other consumers to try again.
Is there a way for the Lambda function to tell SQS that the message has failed and it should go back on the queue immediately ?
If you're using the built-in AWS Lambda SQS integration, then simply throwing an exception (returning a non-success status from the Lambda function) will result in all the SQS messages in the event of that invocation being placed back in the queue immediately.
From the documentation:
When Lambda reads a batch, the messages stay in the queue but become
hidden for the length of the queue's visibility timeout. If your
function successfully processes the batch, Lambda deletes the messages
from the queue. If your function is throttled, returns an error, or
doesn't respond, the message becomes visible again. All messages in a
failed batch return to the queue, so your function code must be able
to process the same message multiple times without side effects.
Call the ChangeMessageVisibility API and set the visibility timeout of each failed message to 0 seconds. Then throw an exception.
If you received a batch of SQS messages and were able to successfully process some of them, then explicitly delete those (successful) messages from the underlying SQS queue, then set the visibility timeout of each remaining failed message to 0 seconds, then throw an exception.
Setting a message's visibility timeout to 0 terminates the visibility timeout for that message and makes the message immediately visible to other consumers.
I have an AWS SQS Queue (standard, non-FIFO) that has a Lambda function as a consumer.
Whenever I send a bunch of messages (usually around 10 at a time) to the queue, only about 2 get picked up by lambda (verified in CloudWatch Logs). The others disappear from the queue.
The Lambda batch size is set to 1, so I would expect all 10 messages to sit in the queue and get picked up by Lambda one by one, but that's not happening. I'm using CloudWatch to check what Lambda is doing, and there is no trace of the missing messages.
I verified in Lambda that it only gets one message every time, by logging the size of the event.Records array (which is always 1).
The Queue also has a Dead Letter Queue. Initially the Maximum Receives was set to 1. When I increased that to 3, more messages were getting picked up after the queues Visibility timeout, but still only a few.
My Queue settings
Visibility timeout: 2 minutes
Delivery delay: 0 seconds
Receive message wait time: 5 seconds
Message retention period: 4 days
Maximum message size: 256kb
I'm wondering why the messages aren't being processed, but instead disappear?
The typical reason why messages are lost is that the Lambda function triggered by Amazon SQS is not correctly processing all Records passed to the function.
Make sure the code loops through all Records passed in the even parameter, since multiple messages can be provided in each Lambda invocation.
Turns out this was related to the Reserved Concurrency of the Lambda function. My concurrency was set to 1, which caused issues.
My expectation
SQS messages will remain in the queue until there's a Lambda function available to pick them up.
In reality
Messages that are not picked up by Lambda because the function is throttled, and after the Visibility Timeout are treated as failed message.
There's an excellent blog post about this issue: https://data.solita.fi/lessons-learned-from-combining-sqs-and-lambda-in-a-data-project/
I have a system where a Lambda is triggered with event source as an SQS Queue.Each message gets our own internal unique id to differentiate between two requests .
Now lambda deletes the message from the queue automatically after sqs invocation and keeps the message in inflight while processing it so duplicate processing of a unique message should never occur ideally.
But when I checked my logs a message with the same unique id was processed within 100 milliseconds of the time frame of each other.
So This seems like two lambdas were triggered for one message and something failed at the end of aws it was either visibility timeout or something else.I have read online that few others have gone through the same situation.
Can anyone who has gone through the same situation explain how did they solve it or people with current scalable systems who don't have this kind of issue can help me out with the reasons why I could be having it ?
Note:- One single message was successfully executed Twice this wasn't the case of retry on failure.
I faced a similar issue, where a lambda (let's call it lambda-1) is triggered through a queue, and lambda-1 further invokes lambda-2 'synchronously' (https://docs.aws.amazon.com/lambda/latest/dg/invocation-sync.html) and the message basically goes to inflight and return back after visibility timeout expiry and triggers lambda-1 again. This goes on in a loop.
As per the link above:
"For functions with a long timeout, your client might be disconnected
during synchronous invocation while it waits for a response. Configure
your HTTP client, SDK, firewall, proxy, or operating system to allow
for long connections with timeout or keep-alive settings."
Making async calls in lambda-1 can resolve this issue. In the case above, invoking lambda-2 with InvocationType='Event' returns back, which in-turn deletes the item from queue.