Is it necessary for a Lambda to delete messages from an SQS queue after processing? - amazon-web-services

I'm looking at the the AWS SQS documentation here: https://docs.aws.amazon.com/sdk-for-net/v3/developer-guide/ReceiveMessage.html#receive-sqs-message
My understanding is that we need to delete the message using AmazonSQSClient.DeleteMessage() once we're done processing it, but is this necessary when we're working with an SQS triggered Lambda?
I'm testing with a Lambda function that's triggered by an SQSEvent, and unless I'm mistaken, it appears that if the Lambda function runs to completion without throwing any errors, the message does NOT return to the SQS queue. If this is true, the I would rather avoid making that unnecessary call to AmazonSQSClient.DeleteMessage().
Here is a similar question from 2019 with the top answer saying that the SDK does not delete messages automatically and that they need to be explicitly deleted within the code. I'm wondering if anything has changed since then.
Thoughts?

The key here is that you are using the AWS Lambda integration with SQS. In that instance AWS Lambda handles retrieving the messages from the queue (making them available via the event object), and automatically deletes the message from the queue for you if the Lambda function returns a success status. It will not delete the message from the queue if the Lambda function throws an error.
When using AWS Lambda integration with SQS you should not be using the AWS SDK to interact with the SQS queue at all.
Update:
Lambda now supports partial batch failure for SQS whereby the Lambda function can return a list of failed messages and only those will become visible again.

Amazon SQS doesn't automatically delete a message after retrieving it for you, in case you don't successfully receive the message (for example, if the consumers fail or you lose connectivity). To delete a message, you must send a separate request which acknowledges that you've successfully received and processed the message.
This has not changed and likely won’t change in the future as there us no way for SQS to definitively know in all cases if messages have successfully been processed. If SQS started to “assume” what happens downstream it risk becoming unreliable in many scenarios.

Yes, otherwise the next time you ask for a set of messages, you will get the same messages back - maybe not on the next call, but eventually you will. You likely don't want to keep processing the same set of messages over and over.

Related

How does SQS keep track of messages?

I have a pretty standard setup of feeding SQS to Lambda. The lambda reads the message and makes a web request to a defined endpoint.
If I encounter an exception during processing of the SQS message that is due to the form of the message then I put the message on a dead letter queue.
If I encounter an error with the web request, I put the message back on the feeding queue to make the HTTP request at a later time.
This seems to work fine, but we just ran into an issue where an HTTP endpoint was down for 4 days and the feeding queue dropped the message. I imagine this has something to do with the retention period setting of the queue.
Questions
Is there a way to know, in the lambda, how many times a message has been replayed?
How did the feeder queue know that the message that was re-enqueued was the same as the one that was originally put on the queue?
I'm currently not explicitly deleting a message off the queue. Not having that, hasn't seemed to cause any issues, no re-processing of messages or anything. Should I be explicitly deleting them?
The normal process would be:
The AWS Lambda function is triggered, with the message(s) passed via the event parameter
If the Lambda function successfully processes the message(s), it should return a 'success' code (200) and the message is automatically removed from the queue
If the Lambda function is unable to process the message, it should return a 'failure' code (eg 400) and Amazon SQS will automatically attempt to re-process the message (unless it has exceeded the retry count)
If the Lambda function fails (eg due to a timeout), Amazon SQS will automatically attempt to re-process the message (unless it has exceeded the retry count)
If a message has exceeded its retry count, Amazon SQS will move the message to the Dead Letter Queue
To answer your questions:
If you wish to take responsibility for these activities yourself, you can use the ApproximateReceiveCount attribute on the message. In the request, it appears that you should add AttributeNames=['ApproximateReceiveCount'], but the documentation is a bit contradictory. You might need to use All instead.
Since you are sending a new message to the queue, Amazon SQS is not aware that it is the same message. The message is not 're-enqueued' since it is a new message.
When your Lambda function returns 'success' (200), the message is being deleted off the queue for you.
You might consider using the standard functionality for retries and Dead Letter Queues rather than implementing that logic yourself.

SQS Lambda Trigger with Visibility Timeout extension

I'm working on a solution where I have a SQS queue with Lambda trigger. My understanding is Lambda will receive messages in batches to be processed, and once Lambda function is successful, the messages in the SQS queue is automatically deleted. However, how do I only allow some of those messages to be deleted?
Let's assume this use case:
Lambda function receives a batch with 10 messages, and only 7 messages are valid and can be processed, and the other 3 messages needs to be reprocessed at later point.
My initial thought was I could update the visibility timeout via boto3.sqs.change_visibility_timeout for each of the 3 messages to have it reprocessed after the timeout, however, since overall lambda function execution is successful, all 10 messages are deleted from SQS queue.
Any suggestions?
Yes, by default, the Lambda function deletes all the messages upon success. You would need to handle this in your code, but not by changing the visibility timeout of the messages.
Add DLQ (dead-letter queue) that will actually handle the failed messages (messages go to DLQ after a certain number of failed attempts to be processed, depending on how you set it up)
You have few options here:
You can handle each item yourself, and delete messages that are processed successfully. In case of a message that's not successful, you can throw an error and it won't be deleted automatically by the lambda function
If you use JavaScript you can try with Middy
If you use Python, you can use Lambda Powertools Python
For AWS Lambdas with an SQS trigger, by default, when your function encounters an error processing one or more messages in a given batch, the entire batch is marked as a failure. All of the messages in the batch are made visible again in the queue. Depending on your redrive policy, you can end up repeatedly processing successful messages along with the failures.
Rather than change the visibility timeout, the simplest way to specify which messages should be retried later and which can safely be deleted from the queue is to change the function response type to ReportBatchItemFailures. This allows you to return a list of failed message ids, indicating that only those messages in the batch should be made visible again in the queue.
Here's what the reporting syntax looks like for a handler function in Node.js:
exports.handler = async (event) => {
// Process the event
const batchItemFailureResponse = {
batchItemFailures: [
{
itemIdentifier: "idFailedMessage1"
},
{
itemIdentifier: "idFailedMessage2"
},
{
itemIdentifier: "idFailedMessage3"
}
]
};
return batchItemFailureResponse;
};
There is more information to be found in the official documentation.
This response type is configured when setting a queue as an event source for the Lambda. If you're configuring from the console, navigate to the Lambda function page, select the Configuration tab, and then choose Triggers. Then choose Add Trigger and choose the SQS trigger type. In addition to providing the standard parameters, be sure to check the box under Report batch item failures after expanding Additional Settings. It should look something like this:
Add trigger with batch failure reporting
This parameter must be set when first creating the trigger.
This response type can also be defined if you use CloudFormation templates to provision your resources. See the AWS documentation for more information. Note that if you use AWS SAM event source mappings, the documentation suggests that adding FunctionResponseTypes to the YAML with ReportBatchItemFailures in the type list isn't supported. That is incorrect, the documentation is simply outdated. There is an open issue around addressing this oversight.
Finally, in addition to reporting batch item failures, you should provision a target DLQ (dead-letter queue) and determine a reasonable maximum receive count so that action can be taken on messages that fail repeatedly.

SNS > AWS Lambda asyncronous invocation queue vs. SNS > SQS > Lambda

Background
This archhitecture relies solely on Lambda's asyncronous invocation mechanism as described here:
https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html
I have a collector function that is invoked once a minute and fetches a batch of data in that can vary drastically in size (tens of of KB to potentially 1-3MB). The data contains a JSON array containing one-to-many records. The collector function segregates these records and publishes them individually to an SNS topic.
A parser function is subribed the SNS topic and has a concurrency limit of 3. SNS asynchronously invokes the parser function per record, meaning that the built-in AWS managed Lambda asyncronous queue begins to fill up as the instances of the parser maxes out at 3. The Lambda queueing mechanism initiates retries at incremental backups when throttling occurs, until the invocation request can be processed by the parser function.
It is imperitive that a record does not get lost during this process as they can not be resurrected. I will be using dead letter queues where needed to ensure they ultimately end up somewhere in case of error.
Testing this method out resulted in no lost invocation. Everything worked as expected. Lambda reported hundreds of throttle responses but I'm relying on this to initiate the Lambda retry behaviour for async invocations. My understanding is that this behaivour is effectively the same as that which I'd have to develop and initiate myself if I wanted to retry consuming a message coming from SQS.
Questions
1. Is the built-in AWS managed Lambda asyncronous queue reliable?
The parser could be subject to a consistent load of 200+ invocations per minute for prelonged periods so I want to understand whether the Lambda queue can handle this as sensibly as an SQS service. The main part that concerns me is this statement:
Even if your function doesn't return an error, it's possible for it to receive the same event from Lambda multiple times because the queue itself is eventually consistent. If the function can't keep up with incoming events, events might also be deleted from the queue without being sent to the function. Ensure that your function code gracefully handles duplicate events, and that you have enough concurrency available to handle all invocations.
This implies that an incoming invocation may just be deleted out of thin air. Also in my implementation I'm relying on the retry behaviour when a function throttles.
2. When a message is in the queue, what happens when the message timeout is exceeded?
I can't find a difinitive answer but I'm hoping the message would end up in the configured dead letter queue.
3. Why would I use SQS over the Lambda queue when SQS presents other problems?
See the articles below for arguments against SQS. Overpulling (described in the second link) is of particular concern:
https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes/
https://medium.com/#zaccharles/lambda-concurrency-limits-and-sqs-triggers-dont-mix-well-sometimes-eb23d90122e0
I can't find any articles or discussions of how the Lambda queue performs.
Thanks for reading!
Quite an interesting question. There's a presentation that covered queues in detail. I can't find it at the moment. The premise is the same as this queues are leaky buckets
So what if I add more Leaky Buckets. We'll you've delayed the leaking, however it's now leaking into another bucket. Have you solved the problem or delayed it?
What if I vibrate the buckets at different frequencies?
Further reading:
operate lambda
message expiry
message timeout
DDIA / DDIA Online
SQS Performance
sqs failure modes
mvce is missing from this question so I cannot address the the precise problem you are having.
As for an opinion on which to choose for SQS and Lambda queue I'll point to the Meta on this
sqs faq mentions Kinesis streams
sqs sns kinesis comparison
TL;DR;
It depends
I think the biggest advantage of using your own queue is the fact that you as a user have visibility into the state of the your backpressure.
Using the Lambda async invoke method, you have the potential to get throttled exceptions with the 'guarantee' that lambda will retry over an interval. If using a SQS source queue instead, you have complete visibility into the state of your message processing at all times with no ambiguity.
Secondly regarding overpulling. In theory this is a concern but in practice its never happened to me. I've run applications requiring thousands of transactions per second and never once had problems with SQS -> Lambda. Obviously set your retry policy appropriately and use a DLQ as transient/unpredictable errors CAN occur.

AWS SQS Dead Letter Queue notifications

I'm trying to design a small message processing system based on SQS, Lambda, and SNS. In case of failure, I'd like for the message to be enqueued in a Dead Letter Queue (DLQ) and for a webhook to be called.
I'd like to know what the most canonical or reasonable way of achieving that would look like.
Currently, if everything goes well, the process should be as follows:
SQS (in place to handle retries) enqueues a message
Lambda gets invoked by SQS and processes the message
Lambda sends a webhook and finishes normally
If something in the lambda goes wrong (success webhook cannot be called, task at hand cannot be processed), the easiest way to achieve what I want seems to be to set up a DLQ1 that SQS would put the failed messages in. An auxiliary lambda would then be called to process this message, pass it to SNS, which would call the failure webhook, and also forward the message to DLQ2, the final/true DLQ.
Is that the best approach?
One alternative I know of is Alarms, though I've been warned that they are quite tricky. Another one would be to have lambda call the error reporting webhook if there's a failure on the last retry, although that somehow seems inappropriate.
Thanks!
Your architecture looks good enough in case of success, but I personally find it quite confusing if anything goes wrong as I don't see why you need two DLQs to begin with.
Here's what I would do in case of failure:
Define a DLQ on your source SQS Queue and set the maxReceiveCount to e.g. 3, meaning if messages fail three times, they will be redirected to the configured DLQ
Create a Lambda that listens to this DLQ.
Execute the webhook inside this Lambda.
Since step 3 automatically deletes the message from the Queue once it has been processed and, apparently, you want the messages to be persisted somewhere, store the content of the message in a file on S3 and store the file metadata (bucket and key) in a table in DynamoDB, so you can always query for failed messages.
I don't see any role for SNS here unless you want multiple subscribers for a given message, but as I see this is not the case.
This way, you need need to maintain only one DLQ and you can get rid of SNS as it's only adding an extra layer of complexity to your architecture.

What if my lambda job, which is subscribed to an AWS SNS topic, goes down or stops working?

I have one publisher and one subscriber for my SNS topic in AWS.
Suppose my subscriber is getting failed and exiting with a failure.
Will SNS repush those failed messages?
If not...
Is there another way to achieve that goal where my system starts processing from the last successful lambda execution?
There is a retry policy, but if your application already received the message, then no. If something goes wrong you won't see it again and since Lambdas don't carry state...You could be in trouble.
I might consider looking at SQS instead of SNS. Remember, messages in SQS won't be removed until you remove them and you can set a window of invisibility. Therefore, you can easily ensure the next Lambda execution picks up where things left off (depending on your settings). Each Lambda would then be responsible for removing that message from SQS and that's how you'd know the message was processed.
Without knowing more about your application and needs, I couldn't say for sure...But I would take a look at it. I've built a "taskmaster" Lambda before that ran on a schedule and read from an SQS queue (multiple queues actually - the scheduled job passed different JSON event based on which queue to read from). It would then pass the job off to the appropriate Lambda "worker" which would then remove that message. Should it stop working...Well, the invisibility period would timeout (and 5 minutes isn't bad here given that's all Lambdas can execute for) and the next Lambda would pick it up. The taskmaster then would run as often as needed and read as many jobs from the queue as necessary. This really helps you have complete control over at what rate you are processing things, how many times you are retrying things, etc. Then you can also make use of a dead-letter queue to catch anything that may have failed (also, think about sticking things back into the queue).
You have a LOT of flexibility with SQS that I'm not really sure you get with SNS to be honest. I was never fond of SNS, though it too has a place and time and so again without knowing more here I couldn't say if SQS would be the fit for you...But I think your concerns can be taken care of with SQS if it makes sense for your application.