SQS SendMessageBatch Partial Failure - amazon-web-services

I want to send messages to SQS in Batch. I had a look at the API and it seems SendMessageBatch will work here.
However one thing I am not able to fully understand is it returns a List of BatchResultErrorEntry in case of partial failure.
The AWS docs don't mention whether these are meant to be retried or just for the sake of information.
If these are meant to be retried will there be a risk of getting into an infinite loop if the same items fail again ? In this case should we identify the items out of the failed ones which can be safely retried based on the Code. Also it looks like since the failed message itself is not returned back we would have to maintain a mapping of the BatchId to the Message to resend the message.
Or is the best practise here to log these and throw an Exception to the caller that these items failed.
Looking for suggestions.

Those are meant to be retried using the exponential backoff strategy.

Related

is it possible to know how many times sqs messsage has been read

I have a use case to know how many times sqs message has been read in my code.
For example we read message from SQS, for abc reason/exception we cant process that message . Now the same message available in queue to read after visibility timeout.
This will create endless loop. Is there a way to know how many times particular sqs message has been read and returned back to queue.
I am aware this can be handled via dead letter queue. Since that requires more effort I am checking is there any other option
i dont want to retry the message if it fails more than x time and i want to delete it. Is it possible in SQS
You can do this manually by looking at the approximateReceiveCount attribute of your messages, see this question on how to do so. You just need to implement the logic to read the count and decide whether to try processing the message or delete it. Note however that receiveCount is affected by more than just programmatically processing messages: viewing messages in the console will increment it too.
That being said a DLQ is a premade solution for exactly this usecase. It's not a lot of additional work: all you have to do is create another SQS queue, set it as the DLQ of your processing queue, and set the number of retries. Then, the DLQ handles all your redrive logic, and instead of deleting messages after n failures they're moved to the DLQ, where you can manually look at them to understand why they're failing, set metrics alarms on the queue, and if you want manually re-drive the messages into your processing queue. Or just ignore them until they age out of the queue based on its retention policy - the important thing is that the DLQ gives you the option of being able to see which messages failed after the fact, while deleting them outright does not.
When calling ReceiveMessage(), you can specify a list of AttributeNames that you would like returned.
One of these attributes is ApproximateReceiveCount, which returns "the number of times a message has been received across all queues but not deleted".
It is an 'approximate' count due to the highly parallel nature of SQS -- it is possible that the count is slightly off if a message was processed around the same time as this request.

The payload from my subscription doesn't show up in Nifi flow

After I sent a message to my GCP subscription, it takes a minute or two (should be instant) to appear in my Nifi flow. At this point, I see a bunch of XML and my payload isn't there. Does anyone know what's possibly happening?
If your push messages are not acknowledged then it may slow down delivery of the rest significantly.
Your use case looks more like the endpoints don't acknowledge it's delivery instantly (or acknowledgement is late due to some other reasons). If the message is not acknowledged immediately then a system will retry to deliveer it (with some delay) and it will keep trying untill it's acknowledged.
Also look at the Message Flow Control documentation which albo may point you to a solution.
Similar topic was also discussed here in StackOverflow (which might help you).

How to handle Dead Letter Queues in Amazon SQS?

I am using event-driven architecture for one of my projects. Amazon Simple Queue Service supports handling failures.
If a message was not successfully handled, it does not get to the part where I delete the message from the queue. If it's a one-time failure, it is handled graciously. However, if it is an erroneous message, it makes its way into DLQ.
My question is what should be happening with DLQs later on? There are thousands of those messages stuck in the DLQ. How are they supposed to be handled?
I would love to hear some real-life examples and engineering processes that are in place in some of the organizations.
"It depends!"
Messages would have been sent to the Dead Letter Queue because something didn't happen as expected. It might be due to a data problem, a timeout or a coding error.
You should:
Start examining the messages that went to the Dead Letter Queue
Try and re-process the messages to determine the underlying cause of the failure (but sometimes it is a random failure that you cannot reproduce)
Once a cause is found, update the system to handle that particular use-case, then move onto the next cause
Common causes can be database locks, network errors, programming errors and corrupt data.
It's probably a good idea to setup some sort of monitoring so that somebody investigates more quickly, rather than letting it accumulate to thousands of messages.
The messages moved to DLQ are considered as you said, erroneous.
If the messages are erroneous due to a bug in the code etc, you should redrive these DLQ messages to source queue once you fixed the bug. So that they'll have another chance to be reprocessed.
It is very unlikely that "temporarly" erroneous messages are moved to DLQ, if you already configured the maxReceiveCount as 3 or more for your source queue. Temporary problems are mostly bypassed with this retry configuration.
And eventually DLQ is also an ordinary SQS queue which retains messages up to 14 days. Even if there are thousands of messages there, they will be gone. At this point, there are two options:
Messages in DLQ are "really" erroneous. So see the metrics, messages and logs to identify the root cause. If there is no bug to fix, it means you keep unrequired data in DLQ. So there is nothing wrong to lose them in 14 days. If there is a bug, fix it an simply redrive messages from DLQ to source queue.
You dont want to investigate through the messages to identify that what was the reason for failure, and you only want to persist message data for historical reasons (god knows why). You can create a lambda function to poll messages and persist in a desired target database.

AWS SQS Dead Letter Queue notifications

I'm trying to design a small message processing system based on SQS, Lambda, and SNS. In case of failure, I'd like for the message to be enqueued in a Dead Letter Queue (DLQ) and for a webhook to be called.
I'd like to know what the most canonical or reasonable way of achieving that would look like.
Currently, if everything goes well, the process should be as follows:
SQS (in place to handle retries) enqueues a message
Lambda gets invoked by SQS and processes the message
Lambda sends a webhook and finishes normally
If something in the lambda goes wrong (success webhook cannot be called, task at hand cannot be processed), the easiest way to achieve what I want seems to be to set up a DLQ1 that SQS would put the failed messages in. An auxiliary lambda would then be called to process this message, pass it to SNS, which would call the failure webhook, and also forward the message to DLQ2, the final/true DLQ.
Is that the best approach?
One alternative I know of is Alarms, though I've been warned that they are quite tricky. Another one would be to have lambda call the error reporting webhook if there's a failure on the last retry, although that somehow seems inappropriate.
Thanks!
Your architecture looks good enough in case of success, but I personally find it quite confusing if anything goes wrong as I don't see why you need two DLQs to begin with.
Here's what I would do in case of failure:
Define a DLQ on your source SQS Queue and set the maxReceiveCount to e.g. 3, meaning if messages fail three times, they will be redirected to the configured DLQ
Create a Lambda that listens to this DLQ.
Execute the webhook inside this Lambda.
Since step 3 automatically deletes the message from the Queue once it has been processed and, apparently, you want the messages to be persisted somewhere, store the content of the message in a file on S3 and store the file metadata (bucket and key) in a table in DynamoDB, so you can always query for failed messages.
I don't see any role for SNS here unless you want multiple subscribers for a given message, but as I see this is not the case.
This way, you need need to maintain only one DLQ and you can get rid of SNS as it's only adding an extra layer of complexity to your architecture.

What if my lambda job, which is subscribed to an AWS SNS topic, goes down or stops working?

I have one publisher and one subscriber for my SNS topic in AWS.
Suppose my subscriber is getting failed and exiting with a failure.
Will SNS repush those failed messages?
If not...
Is there another way to achieve that goal where my system starts processing from the last successful lambda execution?
There is a retry policy, but if your application already received the message, then no. If something goes wrong you won't see it again and since Lambdas don't carry state...You could be in trouble.
I might consider looking at SQS instead of SNS. Remember, messages in SQS won't be removed until you remove them and you can set a window of invisibility. Therefore, you can easily ensure the next Lambda execution picks up where things left off (depending on your settings). Each Lambda would then be responsible for removing that message from SQS and that's how you'd know the message was processed.
Without knowing more about your application and needs, I couldn't say for sure...But I would take a look at it. I've built a "taskmaster" Lambda before that ran on a schedule and read from an SQS queue (multiple queues actually - the scheduled job passed different JSON event based on which queue to read from). It would then pass the job off to the appropriate Lambda "worker" which would then remove that message. Should it stop working...Well, the invisibility period would timeout (and 5 minutes isn't bad here given that's all Lambdas can execute for) and the next Lambda would pick it up. The taskmaster then would run as often as needed and read as many jobs from the queue as necessary. This really helps you have complete control over at what rate you are processing things, how many times you are retrying things, etc. Then you can also make use of a dead-letter queue to catch anything that may have failed (also, think about sticking things back into the queue).
You have a LOT of flexibility with SQS that I'm not really sure you get with SNS to be honest. I was never fond of SNS, though it too has a place and time and so again without knowing more here I couldn't say if SQS would be the fit for you...But I think your concerns can be taken care of with SQS if it makes sense for your application.