changeVisibilityTimeout() call to SQS from Lambda fails without throwing exception - amazon-web-services

I have a lambda set up with event bridge to receive messages from an SQS with default vis timeout set to 5 minutes.
Inside the lambda, certain messages are supposed to be updated so that the vis timeout is increased to 10 minutes, and then an error is thrown on purpose to resend message (with the 10 min timeout) back to the SQS queue. The code logic in the lambda is as follows:
Check that message meets conditions
Set Visibility timeout to 10 minutes using synchronous (default) SQS client. Ignore the response object (but it should throw an error on timeout or failure, any response should imply success).
Write "New invisibility timeout" to log (this always appears, meaning the call must be completed)
Throw an exception to send the message back to the queue
This does work 99% of the time but I am seeing edge cases (often during heavy traffic) where the message comes back to the lambda after the default visibility timeout of 5 minutes (and they are not timing out because logs show complete processing). Additional info:
Lambda timeout: 5 minutes
Lambda retries: 2
SQS max receives: 5
Using Java 11
I will try setting the default visibility timeout to 10 minutes to circumnavigate the issue but hope it is a temp fix until I can understand root cause. How can the synchronous call to SQS not work but also not throw any exception?

Related

AWS SQS and Lambda Function : How to fail a lambda function and force message back on to queue

We have a Lambda event source that polls SQS for messages and passes them on to a Lambda function.
The SQS has a Visibility Timeout that dictates how long after a consumer (lambda) picks up a message it will re-appear in the queue if not successfully completed.
Is there a way from the Lambda function to force this to happen as soon as we detect an error?
For example, say Lambda has a timeout of 10 mins. The SQS Visibility Timeout will need to be longer than that.
But if the Lambda function detects a problem early on, and throws an error - this will fail the message, but we have to wait for the SQS Visibility Timeout before the message is available for other consumers to try again.
Is there a way for the Lambda function to tell SQS that the message has failed and it should go back on the queue immediately ?
If you're using the built-in AWS Lambda SQS integration, then simply throwing an exception (returning a non-success status from the Lambda function) will result in all the SQS messages in the event of that invocation being placed back in the queue immediately.
From the documentation:
When Lambda reads a batch, the messages stay in the queue but become
hidden for the length of the queue's visibility timeout. If your
function successfully processes the batch, Lambda deletes the messages
from the queue. If your function is throttled, returns an error, or
doesn't respond, the message becomes visible again. All messages in a
failed batch return to the queue, so your function code must be able
to process the same message multiple times without side effects.
Call the ChangeMessageVisibility API and set the visibility timeout of each failed message to 0 seconds. Then throw an exception.
If you received a batch of SQS messages and were able to successfully process some of them, then explicitly delete those (successful) messages from the underlying SQS queue, then set the visibility timeout of each remaining failed message to 0 seconds, then throw an exception.
Setting a message's visibility timeout to 0 terminates the visibility timeout for that message and makes the message immediately visible to other consumers.

When is an SQS message sent to deadletter queue via redrive policy?

I have an SQS queue linked to a deadletter queue via a redrive policy. Terraform sample:
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.deadletter_queue.arn
maxReceiveCount = 10
})
The queue processing is implemented with exponential backoff. If processing is successful the message is simply deleted. If processing fails it retries with an initial VisibilityTimeout of 30 seconds, which doubles each time, the final retry timeout (between 9th and 10th attempt) is roughly 2 hours.
My question is, after the final retry, when does the message get sent to the deadletter queue? Immediately after the 10th attempt or does it have to wait out the VisibilityTimeout (of about 4 hours)?.
My concern is that it seems redundant to put a visibility timeout on a message if it's just going to be sent straight to the deadletter - this delays alerts for example.
If you have 4 hour visibility timeout, after 10 th attempt SQS will have to wait 4 hours before it can deem the attempt unsuccessfully. SQS can't assume that the attempt failed after 1 minute, if it may take 4 hours to process the message. SQS does not know what are you doing with the messages and has no means of checking if your application failed to process the message before visibility timeout expires.
Lets say you are running 2 consumers, A and B.
Visibility timeout is a time period which start right after "consumer A" receives the message. For this duration, "consumer B" is not able to receive the message (message is reserved for the "consumer A"). So visibility timeout's purpose is to "hide" message from other consumers to avoid duplication, while one of the consumers already working on it (also see aws doc here).
About the question, after the final retry, message is immediately moved to dead letter queue. It doesn't wait for a visibility timeout to be over, because it won't have the chance to be consumed again. Message is moved to dead letter queue and removed from source queue.

Not all SQS messages end up in Lambda: most just disappear

I have an AWS SQS Queue (standard, non-FIFO) that has a Lambda function as a consumer.
Whenever I send a bunch of messages (usually around 10 at a time) to the queue, only about 2 get picked up by lambda (verified in CloudWatch Logs). The others disappear from the queue.
The Lambda batch size is set to 1, so I would expect all 10 messages to sit in the queue and get picked up by Lambda one by one, but that's not happening. I'm using CloudWatch to check what Lambda is doing, and there is no trace of the missing messages.
I verified in Lambda that it only gets one message every time, by logging the size of the event.Records array (which is always 1).
The Queue also has a Dead Letter Queue. Initially the Maximum Receives was set to 1. When I increased that to 3, more messages were getting picked up after the queues Visibility timeout, but still only a few.
My Queue settings
Visibility timeout: 2 minutes
Delivery delay: 0 seconds
Receive message wait time: 5 seconds
Message retention period: 4 days
Maximum message size: 256kb
I'm wondering why the messages aren't being processed, but instead disappear?
The typical reason why messages are lost is that the Lambda function triggered by Amazon SQS is not correctly processing all Records passed to the function.
Make sure the code loops through all Records passed in the even parameter, since multiple messages can be provided in each Lambda invocation.
Turns out this was related to the Reserved Concurrency of the Lambda function. My concurrency was set to 1, which caused issues.
My expectation
SQS messages will remain in the queue until there's a Lambda function available to pick them up.
In reality
Messages that are not picked up by Lambda because the function is throttled, and after the Visibility Timeout are treated as failed message.
There's an excellent blog post about this issue: https://data.solita.fi/lessons-learned-from-combining-sqs-and-lambda-in-a-data-project/

Optimizing SQS Lambda configuration for single concurrency microservice composition

Apologies for the title. It's hard to summarise what I'm trying to accomplish.
Basically let me define a service to be an SQS Queue + A Lambda function.
A service (represented by square brackets below) performs a given task, where the queue is the input interface, processes the input, and outputs on to the queue of the subsequent service.
Service 1 Service 2 Service 3
[(APIG) -> (Lambda)] -> [(SQS) -> (Lambda)] -> [(SQS) -> (Lambda)] -> ...
Service 1: Consumes the request and payload, splits it into messages and passes on to the queue of the next service.
Service 2: This service does not have a reserved concurrency. It validates each message on the queue, and if valid, passes on to the next service.
Service 3: Processes each message in the queue (ideally in batches of approximately 100). The lambda here must have a reserved concurrency of 1 (as it hits an API that can't process multiple requests concurrently).
Currently I have the following configuration on Service 3.
Default visibility timeout of queue = 5 minutes
Lambda timeout = 5 minutes
Lambda reserved concurrency = 1
Problem 1: Service 3 consumes x items off the queue and if it finishes processing them within 30 seconds I expect the queue to process the next x items off the queue immediately (ideally x=100). Instead, it seems to always wait 5 minutes before taking the next batch of messages off the queue, even if the lambda completes in 30 seconds.
Problem 2: Service 3 typically consumes a few messages at a time (inconsistent) rather than batches of 100.
A couple of more notes:
In service 3 I do not explicitly delete messages off the queue using the lambda. AWS seems to do this itself when the lambda successfully finishes processing the messages
In service 2 I have one item per message. And so when I send messages to Service 3 I can only send 10 items at a time, which is kind of annoying. Because queue.send_messages(Entries=x), len(x) cannot exceed 10.
Does anyone know how I solve Problem 1 and 2? Is it an issue with my configuration? If you require any further information please ask in comments.
Thanks
Both your problems and notes indicate misconfigured SQS and/or Lambda function.
In service 3 I do not explicitly delete messages off the queue using
the lambda. AWS seems to do this itself when the lambda successfully
finishes processing the messages.
This is definitely not the case here as it would go agains the reliability of SQS. How would SQS know that the message was successfully processed by your Lambda function? SQS doesn't care about consumers and doesn't really communicate with them and that is exactly the reason why there is a thing such as visibility timeout. SQS deletes message in two cases, either it receives DeleteMessage API call specifying which message to be deleted via ReceiptHandle or you have set up redrive policy with maximum receive count set to 1. In such case, SQS will automatically send message to dead letter queue when if it receives it more than 1 time which means that every message that was returned to the queue will be send there instead of staying in the queue. Last thing that can cause this is a low value of Message Retention Period (min 60 seconds) which will drop the message after x seconds.
Problem 1: Service 3 consumes x items off the queue and if it finishes
processing them within 30 seconds I expect the queue to process the
next x items off the queue immediately (ideally x=100). Instead, it
seems to always wait 5 minutes before taking the next batch of
messages off the queue, even if the lambda completes in 30 seconds.
This simply doesn't happen if everything is working as it should. If the lambda function finishes in 30 seconds, if there is reserved concurrency for the function and if there are messages in the queue then it will start processing the message right away.
The only thing that could cause is that your lambda (together with concurrency limit) is timing out which would explain those 5 minutes. Make sure that it really finishes in 30 seconds, you can monitor this via CloudWatch. The fact that the message has been successfully processed doesn't necessarily mean that the function has returned. Also make sure that there are messages to be processed when the function ends.
Problem 2: Service 3 typically consumes a few messages at a time
(inconsistent) rather than batches of 100.
It can never consume 100 messages since the limit is 10 (messages in the sense of SQS message not the actual data that is stored within the message which can be anywhere up to 256 KB, possibly "more" using extended SQS library or similar custom solution). Moreover, there is no guarantee that the Lambda will receive 10 messages in each batch. It depends on the Receive Message Wait Time setting. If you are using short polling (1 second) then only subset of servers which are storing the messages will be polled and a single message is stored only on a subset of those servers. If those two subsets do not match when the message is polled, the message is not received in that batch. You can control this by increasing polling interval, Receive Message Wait Time, (max 20 seconds) but even if there are not enough messages in the queue when the timer finishes, the batch will still be received with fewer messages, possibly zero.
And as it was mentioned in the comments, using this strategy with concurrency set to low number can lead to some problems. Another thing is that you need to ensure that rate at which messages are produced is somehow consistent with the time it takes for one instance of lambda function to process the message otherwise you will end up with constantly growing queue, possibly losing messages after they outlive the Message Retention Period.

SQS Lambda Integration - what happens when an Exception is thrown

The document states that
A Lambda function can fail for any of the following reasons:
The function times out while trying to reach an endpoint.
The function fails to successfully parse input data.
The function experiences resource constraints, such as out-of-memory
errors or other timeouts.
For my case, I'm using C# Lambda with SQS integration
If the invocation fails or times out, every message in the batch will be returned to the queue, and each will be available for processing once the Visibility Timeout period expires
My question: What happen if I, using an SQS Lambda integration ( .NET )
My function throws an Exception
My SQS visibility timer is set to 15 minutes, max receive count is 1, DLQ setup
Will the function retry?
Will it be put into the DLQ when Exceptions are thrown after all retries?
The moment your code throws an unhandled/uncaught exception Lambda fails. If you have max receive count set to 1 the message will be sent to the DLQ after the first failure, it will not be retried. If your max receive count is set to 5 for example, the moment the Lambda function fails, the message will be returned to the queue after the visibility timeout has expired.
The reason for this behaviour is you are giving Lambda permissions to poll the queue on your behalf. If it gets a message it invokes a function and gives you a single opportunity to process that message. If you fail the message returns to the queue and Lambda continues polling the queue on your behalf, it does not care if the next message is the same as the failed message or a brand new message.
Here is a great blog post which helped me understand how these triggers work.