How to handle Dead Letter Queues in Amazon SQS? - amazon-web-services

I am using event-driven architecture for one of my projects. Amazon Simple Queue Service supports handling failures.
If a message was not successfully handled, it does not get to the part where I delete the message from the queue. If it's a one-time failure, it is handled graciously. However, if it is an erroneous message, it makes its way into DLQ.
My question is what should be happening with DLQs later on? There are thousands of those messages stuck in the DLQ. How are they supposed to be handled?
I would love to hear some real-life examples and engineering processes that are in place in some of the organizations.

"It depends!"
Messages would have been sent to the Dead Letter Queue because something didn't happen as expected. It might be due to a data problem, a timeout or a coding error.
You should:
Start examining the messages that went to the Dead Letter Queue
Try and re-process the messages to determine the underlying cause of the failure (but sometimes it is a random failure that you cannot reproduce)
Once a cause is found, update the system to handle that particular use-case, then move onto the next cause
Common causes can be database locks, network errors, programming errors and corrupt data.
It's probably a good idea to setup some sort of monitoring so that somebody investigates more quickly, rather than letting it accumulate to thousands of messages.

The messages moved to DLQ are considered as you said, erroneous.
If the messages are erroneous due to a bug in the code etc, you should redrive these DLQ messages to source queue once you fixed the bug. So that they'll have another chance to be reprocessed.
It is very unlikely that "temporarly" erroneous messages are moved to DLQ, if you already configured the maxReceiveCount as 3 or more for your source queue. Temporary problems are mostly bypassed with this retry configuration.
And eventually DLQ is also an ordinary SQS queue which retains messages up to 14 days. Even if there are thousands of messages there, they will be gone. At this point, there are two options:
Messages in DLQ are "really" erroneous. So see the metrics, messages and logs to identify the root cause. If there is no bug to fix, it means you keep unrequired data in DLQ. So there is nothing wrong to lose them in 14 days. If there is a bug, fix it an simply redrive messages from DLQ to source queue.
You dont want to investigate through the messages to identify that what was the reason for failure, and you only want to persist message data for historical reasons (god knows why). You can create a lambda function to poll messages and persist in a desired target database.

Related

No bugs in codes, why messages are still being sent to Dead Letter Queue?

I have 2 Lambda functions that respectively sent and received some workloads via a SQS. But many messages are unexpectedly sent to DLQ. I am confident that it wasn't caused by in-code bugs because not all messages went to DLQ.
I set reservedConcurrentExecutions = 1 and maxReceiveCount = 1, does it matter? I'm thinking if I increase maxReceiveCount perhaps fewer messages will be sent to DLQ.
Anyhow, I hope somebody can walk me through the methodology behind that.
Increasing maxReceiveCount worked eventually. The official developer guide says:
If your function returns an error, or can't be invoked because it's at maximum concurrency, processing might succeed with additional attempts. To give messages a better chance to be processed before sending them to the dead-letter queue, set the maxReceiveCount on the source queue's redrive policy to at least 5.

is it possible to know how many times sqs messsage has been read

I have a use case to know how many times sqs message has been read in my code.
For example we read message from SQS, for abc reason/exception we cant process that message . Now the same message available in queue to read after visibility timeout.
This will create endless loop. Is there a way to know how many times particular sqs message has been read and returned back to queue.
I am aware this can be handled via dead letter queue. Since that requires more effort I am checking is there any other option
i dont want to retry the message if it fails more than x time and i want to delete it. Is it possible in SQS
You can do this manually by looking at the approximateReceiveCount attribute of your messages, see this question on how to do so. You just need to implement the logic to read the count and decide whether to try processing the message or delete it. Note however that receiveCount is affected by more than just programmatically processing messages: viewing messages in the console will increment it too.
That being said a DLQ is a premade solution for exactly this usecase. It's not a lot of additional work: all you have to do is create another SQS queue, set it as the DLQ of your processing queue, and set the number of retries. Then, the DLQ handles all your redrive logic, and instead of deleting messages after n failures they're moved to the DLQ, where you can manually look at them to understand why they're failing, set metrics alarms on the queue, and if you want manually re-drive the messages into your processing queue. Or just ignore them until they age out of the queue based on its retention policy - the important thing is that the DLQ gives you the option of being able to see which messages failed after the fact, while deleting them outright does not.
When calling ReceiveMessage(), you can specify a list of AttributeNames that you would like returned.
One of these attributes is ApproximateReceiveCount, which returns "the number of times a message has been received across all queues but not deleted".
It is an 'approximate' count due to the highly parallel nature of SQS -- it is possible that the count is slightly off if a message was processed around the same time as this request.

Managing SQS Queue Manually vs Lambda Trigger

I'm not sure if this would be better served on ServerFault or Software Engineering, willing to move this post if appropriate.
We have somewhat recently started to move some of our data processing pipeline to use queues to manage individual bits of data, whereas previously we had timed lambdas that would pull all data since last change.
While making this change, we noticed that queues didn't work quite as we had anticipated first of all - we thought lambda would just pull items off the queue as the lambdas had availability. Instead, it seems the aws managed lambda trigger grabs a chunk of messages (up to ten) and throws it at the lambda service. If lambda doesn't have availability, the message gets throttled, then replayed after a backoff time, up til our configured replay "error" limit (five). After that, it's thrown into our dead letter queue.
We see a handful of message per day end up in the dead letter queue as a result of throttling. We then throw these back into the main queue (we have a process to do so every handful of hours). However, we weren't 100% sure throttling was the reason for things being pushed over since nothing indicates why the messages are moved over - we just assumed as much because we weren't getting any error logs for those messages. We contacted Amazon support to ask about this, and they were able to actually confirm the messages were in fact "errored" as a result of throttling.
We asked further into their recommendations for this - this must be a common problem right? They first off suggested upping our replay limit, which seemed an obvious no go. Replays occur for any failure, so that would just hammer our lambdas with bad requests when they came through. Asked also if there's any way to differentiate the errors because we don't care for throttling, we'd happily let those retry a dozen times if needed - but no. The other suggestion they had was to manage the queue ourselves from our lambdas. Build our own code within our lambdas to pull messages and then delete them after processing. This seems really counter-intuitive, though - why would every AWS consumer build their own infrastructure?
So I guess my question is, is this what others are doing? Are you using the built in lambda triggers? Are you creating your own code for managing queue consumption? Do you see these sorts of throttling, or is there anything we could do differently? Are there any difference with other services to manage this?
Best practice is to handle errors in your code and manually delete messages that have succeeded. That allows you to handle poison messages without reprocessing the good messages again. Throttles shouldn't be ending up in a DLQ that often. This video from re:Invent 2020 has a good explaination of how this works. Scalable serverless event-driven architectures with SNS, SQS & Lambda. Start at about the 20 minutes mark to get into SQS error handling.

AWS SQS FIFO - How to get more than 10 messages at a time?

Currently we want to pull down an entire FIFO queue, and process the contents, and if any issues, release messages back into the queue.
The problem is, that currently AWS only gives us 10 messages, and won't give us 10 more (which is the way you get bulk messages in SQS, multiple 10 max message requests) until we delete or release the first 10.
We need to get more than 10 though. Is this not possible? We understand we can set the group_id to a random string, and that allows processing more, but then the order isn't guaranteed, which defeats the purpose of FIFO.
I managed to reproduce your results -- I could retrieve 10 messages, but then running the same command again would not return another set of messages.
The relevant documentation seems to be:
While messages with a particular MessageGroupId are invisible, no more messages belonging to the same MessageGroupId are returned until the visibility timeout expires. You can still receive messages with another MessageGroupId as long as it is also visible.
I suspect (just a theory!) that this is to preserve the ordering of messages... If a client asked for a set of messages and they are still being processed, there is the chance that the messages might be returned to the queue. Therefore, no further messages are provided until the original messages are deleted or pass their visibility timeout.
This is only a behaviour of FIFO queues.
It seems that you will need to receive and delete all messages to be able to access them all. I would suggest:
Receive one (or more) message.
Process it. If everything worked, delete the message.
If there were problems, push the message to a new queue.
Once the queue is empty, you would need to read from the new queue and send them back to the original queue (which should preserve ordering).
If you frequently require more capabilities that Amazon SQS provides, you could consider using Amazon MQ – Managed message broker service for ActiveMQ. It has many more capabilities (but is accordingly less 'simple').
If you set another MessageGroupId, you can get another 10 messages, even you don't release or delete the previous ones.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/using-messagegroupid-property.html

What if my lambda job, which is subscribed to an AWS SNS topic, goes down or stops working?

I have one publisher and one subscriber for my SNS topic in AWS.
Suppose my subscriber is getting failed and exiting with a failure.
Will SNS repush those failed messages?
If not...
Is there another way to achieve that goal where my system starts processing from the last successful lambda execution?
There is a retry policy, but if your application already received the message, then no. If something goes wrong you won't see it again and since Lambdas don't carry state...You could be in trouble.
I might consider looking at SQS instead of SNS. Remember, messages in SQS won't be removed until you remove them and you can set a window of invisibility. Therefore, you can easily ensure the next Lambda execution picks up where things left off (depending on your settings). Each Lambda would then be responsible for removing that message from SQS and that's how you'd know the message was processed.
Without knowing more about your application and needs, I couldn't say for sure...But I would take a look at it. I've built a "taskmaster" Lambda before that ran on a schedule and read from an SQS queue (multiple queues actually - the scheduled job passed different JSON event based on which queue to read from). It would then pass the job off to the appropriate Lambda "worker" which would then remove that message. Should it stop working...Well, the invisibility period would timeout (and 5 minutes isn't bad here given that's all Lambdas can execute for) and the next Lambda would pick it up. The taskmaster then would run as often as needed and read as many jobs from the queue as necessary. This really helps you have complete control over at what rate you are processing things, how many times you are retrying things, etc. Then you can also make use of a dead-letter queue to catch anything that may have failed (also, think about sticking things back into the queue).
You have a LOT of flexibility with SQS that I'm not really sure you get with SNS to be honest. I was never fond of SNS, though it too has a place and time and so again without knowing more here I couldn't say if SQS would be the fit for you...But I think your concerns can be taken care of with SQS if it makes sense for your application.