Azure Webjobs failed message retry policy - azure-webjobs

I'm trying to find out if there is any way to modify the webjobs sdk retry policy.
Right now if a webjob throws an exception it is re-queued straight away. This isn't ideal especially if the error was due to something like a DB timeout.
Does anyone know if the policy is modifiable to something like an exponential backoff ? Or so other workaround ?

As far as I know, the SDK does not support a configurable retry policy (not to be confused with the queue client's retry policy). If I'm understanding your intentions, you would want to catch the exception, call DeleteMessage to pull it off the queue, and re-enqueue an identical message with a longer initial visibility delay, then rethrow the exception. You would need to keep track of the number of dequeues within the message itself, since deleting the message will delete the DequeueCount. If you wanted to backoff and retry the operation that caused the exception (while holding onto the queue message), you can call UpdateMessage to extend the visibility timeout, and then finally throw an exception.

Related

SQS SendMessageBatch Partial Failure

I want to send messages to SQS in Batch. I had a look at the API and it seems SendMessageBatch will work here.
However one thing I am not able to fully understand is it returns a List of BatchResultErrorEntry in case of partial failure.
The AWS docs don't mention whether these are meant to be retried or just for the sake of information.
If these are meant to be retried will there be a risk of getting into an infinite loop if the same items fail again ? In this case should we identify the items out of the failed ones which can be safely retried based on the Code. Also it looks like since the failed message itself is not returned back we would have to maintain a mapping of the BatchId to the Message to resend the message.
Or is the best practise here to log these and throw an Exception to the caller that these items failed.
Looking for suggestions.
Those are meant to be retried using the exponential backoff strategy.

Is it necessary for a Lambda to delete messages from an SQS queue after processing?

I'm looking at the the AWS SQS documentation here: https://docs.aws.amazon.com/sdk-for-net/v3/developer-guide/ReceiveMessage.html#receive-sqs-message
My understanding is that we need to delete the message using AmazonSQSClient.DeleteMessage() once we're done processing it, but is this necessary when we're working with an SQS triggered Lambda?
I'm testing with a Lambda function that's triggered by an SQSEvent, and unless I'm mistaken, it appears that if the Lambda function runs to completion without throwing any errors, the message does NOT return to the SQS queue. If this is true, the I would rather avoid making that unnecessary call to AmazonSQSClient.DeleteMessage().
Here is a similar question from 2019 with the top answer saying that the SDK does not delete messages automatically and that they need to be explicitly deleted within the code. I'm wondering if anything has changed since then.
Thoughts?
The key here is that you are using the AWS Lambda integration with SQS. In that instance AWS Lambda handles retrieving the messages from the queue (making them available via the event object), and automatically deletes the message from the queue for you if the Lambda function returns a success status. It will not delete the message from the queue if the Lambda function throws an error.
When using AWS Lambda integration with SQS you should not be using the AWS SDK to interact with the SQS queue at all.
Update:
Lambda now supports partial batch failure for SQS whereby the Lambda function can return a list of failed messages and only those will become visible again.
Amazon SQS doesn't automatically delete a message after retrieving it for you, in case you don't successfully receive the message (for example, if the consumers fail or you lose connectivity). To delete a message, you must send a separate request which acknowledges that you've successfully received and processed the message.
This has not changed and likely won’t change in the future as there us no way for SQS to definitively know in all cases if messages have successfully been processed. If SQS started to “assume” what happens downstream it risk becoming unreliable in many scenarios.
Yes, otherwise the next time you ask for a set of messages, you will get the same messages back - maybe not on the next call, but eventually you will. You likely don't want to keep processing the same set of messages over and over.

AWS Lambda triggered twice for a sigle SQS Message

I have a system where a Lambda is triggered with event source as an SQS Queue.Each message gets our own internal unique id to differentiate between two requests .
Now lambda deletes the message from the queue automatically after sqs invocation and keeps the message in inflight while processing it so duplicate processing of a unique message should never occur ideally.
But when I checked my logs a message with the same unique id was processed within 100 milliseconds of the time frame of each other.
So This seems like two lambdas were triggered for one message and something failed at the end of aws it was either visibility timeout or something else.I have read online that few others have gone through the same situation.
Can anyone who has gone through the same situation explain how did they solve it or people with current scalable systems who don't have this kind of issue can help me out with the reasons why I could be having it ?
Note:- One single message was successfully executed Twice this wasn't the case of retry on failure.
I faced a similar issue, where a lambda (let's call it lambda-1) is triggered through a queue, and lambda-1 further invokes lambda-2 'synchronously' (https://docs.aws.amazon.com/lambda/latest/dg/invocation-sync.html) and the message basically goes to inflight and return back after visibility timeout expiry and triggers lambda-1 again. This goes on in a loop.
As per the link above:
"For functions with a long timeout, your client might be disconnected
during synchronous invocation while it waits for a response. Configure
your HTTP client, SDK, firewall, proxy, or operating system to allow
for long connections with timeout or keep-alive settings."
Making async calls in lambda-1 can resolve this issue. In the case above, invoking lambda-2 with InvocationType='Event' returns back, which in-turn deletes the item from queue.

How to handle Dead Letter Queues in Amazon SQS?

I am using event-driven architecture for one of my projects. Amazon Simple Queue Service supports handling failures.
If a message was not successfully handled, it does not get to the part where I delete the message from the queue. If it's a one-time failure, it is handled graciously. However, if it is an erroneous message, it makes its way into DLQ.
My question is what should be happening with DLQs later on? There are thousands of those messages stuck in the DLQ. How are they supposed to be handled?
I would love to hear some real-life examples and engineering processes that are in place in some of the organizations.
"It depends!"
Messages would have been sent to the Dead Letter Queue because something didn't happen as expected. It might be due to a data problem, a timeout or a coding error.
You should:
Start examining the messages that went to the Dead Letter Queue
Try and re-process the messages to determine the underlying cause of the failure (but sometimes it is a random failure that you cannot reproduce)
Once a cause is found, update the system to handle that particular use-case, then move onto the next cause
Common causes can be database locks, network errors, programming errors and corrupt data.
It's probably a good idea to setup some sort of monitoring so that somebody investigates more quickly, rather than letting it accumulate to thousands of messages.
The messages moved to DLQ are considered as you said, erroneous.
If the messages are erroneous due to a bug in the code etc, you should redrive these DLQ messages to source queue once you fixed the bug. So that they'll have another chance to be reprocessed.
It is very unlikely that "temporarly" erroneous messages are moved to DLQ, if you already configured the maxReceiveCount as 3 or more for your source queue. Temporary problems are mostly bypassed with this retry configuration.
And eventually DLQ is also an ordinary SQS queue which retains messages up to 14 days. Even if there are thousands of messages there, they will be gone. At this point, there are two options:
Messages in DLQ are "really" erroneous. So see the metrics, messages and logs to identify the root cause. If there is no bug to fix, it means you keep unrequired data in DLQ. So there is nothing wrong to lose them in 14 days. If there is a bug, fix it an simply redrive messages from DLQ to source queue.
You dont want to investigate through the messages to identify that what was the reason for failure, and you only want to persist message data for historical reasons (god knows why). You can create a lambda function to poll messages and persist in a desired target database.

SQS Lambda Integration - what happens when an Exception is thrown

The document states that
A Lambda function can fail for any of the following reasons:
The function times out while trying to reach an endpoint.
The function fails to successfully parse input data.
The function experiences resource constraints, such as out-of-memory
errors or other timeouts.
For my case, I'm using C# Lambda with SQS integration
If the invocation fails or times out, every message in the batch will be returned to the queue, and each will be available for processing once the Visibility Timeout period expires
My question: What happen if I, using an SQS Lambda integration ( .NET )
My function throws an Exception
My SQS visibility timer is set to 15 minutes, max receive count is 1, DLQ setup
Will the function retry?
Will it be put into the DLQ when Exceptions are thrown after all retries?
The moment your code throws an unhandled/uncaught exception Lambda fails. If you have max receive count set to 1 the message will be sent to the DLQ after the first failure, it will not be retried. If your max receive count is set to 5 for example, the moment the Lambda function fails, the message will be returned to the queue after the visibility timeout has expired.
The reason for this behaviour is you are giving Lambda permissions to poll the queue on your behalf. If it gets a message it invokes a function and gives you a single opportunity to process that message. If you fail the message returns to the queue and Lambda continues polling the queue on your behalf, it does not care if the next message is the same as the failed message or a brand new message.
Here is a great blog post which helped me understand how these triggers work.