AWS SQS Dead Letter Queue notifications

AWS SQS Dead Letter Queue notifications - amazon-web-services

I'm trying to design a small message processing system based on SQS, Lambda, and SNS. In case of failure, I'd like for the message to be enqueued in a Dead Letter Queue (DLQ) and for a webhook to be called.
I'd like to know what the most canonical or reasonable way of achieving that would look like.
Currently, if everything goes well, the process should be as follows:
SQS (in place to handle retries) enqueues a message
Lambda gets invoked by SQS and processes the message
Lambda sends a webhook and finishes normally
If something in the lambda goes wrong (success webhook cannot be called, task at hand cannot be processed), the easiest way to achieve what I want seems to be to set up a DLQ1 that SQS would put the failed messages in. An auxiliary lambda would then be called to process this message, pass it to SNS, which would call the failure webhook, and also forward the message to DLQ2, the final/true DLQ.
Is that the best approach?
One alternative I know of is Alarms, though I've been warned that they are quite tricky. Another one would be to have lambda call the error reporting webhook if there's a failure on the last retry, although that somehow seems inappropriate.
Thanks!

Your architecture looks good enough in case of success, but I personally find it quite confusing if anything goes wrong as I don't see why you need two DLQs to begin with.
Here's what I would do in case of failure:
Define a DLQ on your source SQS Queue and set the maxReceiveCount to e.g. 3, meaning if messages fail three times, they will be redirected to the configured DLQ
Create a Lambda that listens to this DLQ.
Execute the webhook inside this Lambda.
Since step 3 automatically deletes the message from the Queue once it has been processed and, apparently, you want the messages to be persisted somewhere, store the content of the message in a file on S3 and store the file metadata (bucket and key) in a table in DynamoDB, so you can always query for failed messages.
I don't see any role for SNS here unless you want multiple subscribers for a given message, but as I see this is not the case.
This way, you need need to maintain only one DLQ and you can get rid of SNS as it's only adding an extra layer of complexity to your architecture.

Related

How does SQS keep track of messages?

I have a pretty standard setup of feeding SQS to Lambda. The lambda reads the message and makes a web request to a defined endpoint.
If I encounter an exception during processing of the SQS message that is due to the form of the message then I put the message on a dead letter queue.
If I encounter an error with the web request, I put the message back on the feeding queue to make the HTTP request at a later time.
This seems to work fine, but we just ran into an issue where an HTTP endpoint was down for 4 days and the feeding queue dropped the message. I imagine this has something to do with the retention period setting of the queue.
Questions
Is there a way to know, in the lambda, how many times a message has been replayed?
How did the feeder queue know that the message that was re-enqueued was the same as the one that was originally put on the queue?
I'm currently not explicitly deleting a message off the queue. Not having that, hasn't seemed to cause any issues, no re-processing of messages or anything. Should I be explicitly deleting them?

The normal process would be:
The AWS Lambda function is triggered, with the message(s) passed via the event parameter
If the Lambda function successfully processes the message(s), it should return a 'success' code (200) and the message is automatically removed from the queue
If the Lambda function is unable to process the message, it should return a 'failure' code (eg 400) and Amazon SQS will automatically attempt to re-process the message (unless it has exceeded the retry count)
If the Lambda function fails (eg due to a timeout), Amazon SQS will automatically attempt to re-process the message (unless it has exceeded the retry count)
If a message has exceeded its retry count, Amazon SQS will move the message to the Dead Letter Queue
To answer your questions:
If you wish to take responsibility for these activities yourself, you can use the ApproximateReceiveCount attribute on the message. In the request, it appears that you should add AttributeNames=['ApproximateReceiveCount'], but the documentation is a bit contradictory. You might need to use All instead.
Since you are sending a new message to the queue, Amazon SQS is not aware that it is the same message. The message is not 're-enqueued' since it is a new message.
When your Lambda function returns 'success' (200), the message is being deleted off the queue for you.
You might consider using the standard functionality for retries and Dead Letter Queues rather than implementing that logic yourself.

is it possible to know how many times sqs messsage has been read

I have a use case to know how many times sqs message has been read in my code.
For example we read message from SQS, for abc reason/exception we cant process that message . Now the same message available in queue to read after visibility timeout.
This will create endless loop. Is there a way to know how many times particular sqs message has been read and returned back to queue.
I am aware this can be handled via dead letter queue. Since that requires more effort I am checking is there any other option
i dont want to retry the message if it fails more than x time and i want to delete it. Is it possible in SQS

You can do this manually by looking at the approximateReceiveCount attribute of your messages, see this question on how to do so. You just need to implement the logic to read the count and decide whether to try processing the message or delete it. Note however that receiveCount is affected by more than just programmatically processing messages: viewing messages in the console will increment it too.
That being said a DLQ is a premade solution for exactly this usecase. It's not a lot of additional work: all you have to do is create another SQS queue, set it as the DLQ of your processing queue, and set the number of retries. Then, the DLQ handles all your redrive logic, and instead of deleting messages after n failures they're moved to the DLQ, where you can manually look at them to understand why they're failing, set metrics alarms on the queue, and if you want manually re-drive the messages into your processing queue. Or just ignore them until they age out of the queue based on its retention policy - the important thing is that the DLQ gives you the option of being able to see which messages failed after the fact, while deleting them outright does not.

When calling ReceiveMessage(), you can specify a list of AttributeNames that you would like returned.
One of these attributes is ApproximateReceiveCount, which returns "the number of times a message has been received across all queues but not deleted".
It is an 'approximate' count due to the highly parallel nature of SQS -- it is possible that the count is slightly off if a message was processed around the same time as this request.

Is it necessary for a Lambda to delete messages from an SQS queue after processing?

I'm looking at the the AWS SQS documentation here: https://docs.aws.amazon.com/sdk-for-net/v3/developer-guide/ReceiveMessage.html#receive-sqs-message
My understanding is that we need to delete the message using AmazonSQSClient.DeleteMessage() once we're done processing it, but is this necessary when we're working with an SQS triggered Lambda?
I'm testing with a Lambda function that's triggered by an SQSEvent, and unless I'm mistaken, it appears that if the Lambda function runs to completion without throwing any errors, the message does NOT return to the SQS queue. If this is true, the I would rather avoid making that unnecessary call to AmazonSQSClient.DeleteMessage().
Here is a similar question from 2019 with the top answer saying that the SDK does not delete messages automatically and that they need to be explicitly deleted within the code. I'm wondering if anything has changed since then.
Thoughts?

The key here is that you are using the AWS Lambda integration with SQS. In that instance AWS Lambda handles retrieving the messages from the queue (making them available via the event object), and automatically deletes the message from the queue for you if the Lambda function returns a success status. It will not delete the message from the queue if the Lambda function throws an error.
When using AWS Lambda integration with SQS you should not be using the AWS SDK to interact with the SQS queue at all.
Update:
Lambda now supports partial batch failure for SQS whereby the Lambda function can return a list of failed messages and only those will become visible again.

Amazon SQS doesn't automatically delete a message after retrieving it for you, in case you don't successfully receive the message (for example, if the consumers fail or you lose connectivity). To delete a message, you must send a separate request which acknowledges that you've successfully received and processed the message.
This has not changed and likely won’t change in the future as there us no way for SQS to definitively know in all cases if messages have successfully been processed. If SQS started to “assume” what happens downstream it risk becoming unreliable in many scenarios.

Yes, otherwise the next time you ask for a set of messages, you will get the same messages back - maybe not on the next call, but eventually you will. You likely don't want to keep processing the same set of messages over and over.

AWS SQS Boto3 sending messages to dead letter manually

So I am building a small application that uses SQS. I have a simple handler process that determines if a given message is considered processed, marked for retry (to be re-queued) or is not able to be processed (should be sent to dead letter).
However based on the docs it would appear the only way to truly send a message to DL is by using a redrive policy which operates over # of receives a message has racked up. Because of the nature of my application, I could have several valid retries if my process isn't ready to handle a given message, but there are also times I may want to DL a message I have just received. Does AWS/Boto3 not provide a way to mark a specific message for DL?
I know I can just send the message myself to another queue I consider my own DL, I would just rather use AWS' built in tools for this.

I don't believe there is any limitation that would prevent you from sending the message to the deal-letter-queue by yourself.
So just read the message from the Q, if you know it needs to go to the DLQ directly, send it to the DLQ and remove it from the regular Q.

What if my lambda job, which is subscribed to an AWS SNS topic, goes down or stops working?

I have one publisher and one subscriber for my SNS topic in AWS.
Suppose my subscriber is getting failed and exiting with a failure.
Will SNS repush those failed messages?
If not...
Is there another way to achieve that goal where my system starts processing from the last successful lambda execution?

There is a retry policy, but if your application already received the message, then no. If something goes wrong you won't see it again and since Lambdas don't carry state...You could be in trouble.
I might consider looking at SQS instead of SNS. Remember, messages in SQS won't be removed until you remove them and you can set a window of invisibility. Therefore, you can easily ensure the next Lambda execution picks up where things left off (depending on your settings). Each Lambda would then be responsible for removing that message from SQS and that's how you'd know the message was processed.
Without knowing more about your application and needs, I couldn't say for sure...But I would take a look at it. I've built a "taskmaster" Lambda before that ran on a schedule and read from an SQS queue (multiple queues actually - the scheduled job passed different JSON event based on which queue to read from). It would then pass the job off to the appropriate Lambda "worker" which would then remove that message. Should it stop working...Well, the invisibility period would timeout (and 5 minutes isn't bad here given that's all Lambdas can execute for) and the next Lambda would pick it up. The taskmaster then would run as often as needed and read as many jobs from the queue as necessary. This really helps you have complete control over at what rate you are processing things, how many times you are retrying things, etc. Then you can also make use of a dead-letter queue to catch anything that may have failed (also, think about sticking things back into the queue).
You have a LOT of flexibility with SQS that I'm not really sure you get with SNS to be honest. I was never fond of SNS, though it too has a place and time and so again without knowing more here I couldn't say if SQS would be the fit for you...But I think your concerns can be taken care of with SQS if it makes sense for your application.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js