Do failed SQS messages have a visibility timeout? - amazon-web-services

Let's say that a SQS message was consumed by a consumer, and within the visibility timeout, the consumer gave an error. Now the SQS will try to retry the message, so will that be done after the current visibility timeout of the message completes, or will it be done ASAP?

When a message is retrieve from an Amazon SQS queue using ReceiveMessage(), the message(s) will be marked as invisible.
When the worker finishes processing the message, it should call DeleteMessage(), passing the ReceiptHandle of the message(s).
If the SQS queue does not receive a DeleteMessage() request within the timeout period, then the message(s) will reappear on the queue for processing.
Amazon SQS does not know when a "consumer gave an error". All it knows is whether an API call was made to SQS to delete the message, or to ask for the invisibility period to be extended. (This is slightly different if SQS is being accessed by AWS Lambda, but I will assume this is not the case in your situation.)

SQS will retry the message after the current visibility timeout has elapsed, meaning if the consuming lambda runs for x seconds before the error occurs, the retry will happen after currentVisbilityTimeout - x seconds.

Related

AWS SQS how does SQS know that processing of a message failed?

I'm reading aws documents on deadletter queue and re-drive policy, and the document mentioned "The redrive policy specifies the source queue, the dead-letter queue, and the conditions under which Amazon SQS moves messages from the former to the latter if the consumer of the source queue fails to process a message a specified number of times".
However, even the document mentioned "message process failed" several times, I do not understand how sqs detects a message processing failure (and thus triggers re-drive or move to the dead letter queue.)
From what I understand, consumer applications call receiveMessage to retrieve the message from SQS, then process the message. The processing function is not passed in to receiveMessage as a lambda. So how does SQS know that message processing has failed?
When a client (e.g. a lambda function) gets a message from the queue, it has limited time to call DeleteMessage. Each msg has also visibility timeout. If the msg is not deleted by the client within the visibility timeout, SQS "assumes" that the processing failed.
Such messages can be then forwarded to SQS depending on how many failed attempts you setup to tollerate.

AWS lambda with SQS trigger keeps retrying and putting job back in the queue

I have a lambda function with SQS as its trigger. when lambda executes, either it throws an error or not. it will put the job back in the queue and creates a loop and you know about the AWS bill for sure :)
should I return something in lambda function to let SQS know that I got the message(done the job)? how should I ack the message? as far as I know we don't have ack and nack in SQS.
Is there any option in the SQS configuration to only retry N time if any job fails?
For standard uses cases you do not have to actively manage success-failure communication between lambda and SQS. If the lambda returns without error within the timeout period, SQS will know the message was successfully processed. If the function returns an error, then SQS will retry a configurable number of times and finally direct still-failing messages to a Dead Letter Queue (if configured).
Docs: Amazon SQS supports dead-letter queues, which other queues (source queues) can target for messages that can't be processed (consumed) successfully.
Important: Add your DLQ to the SQS queue, not the Lambda. Lambda DLQs are a way to handle errors for async (event-driven) invocation.

Message -> sqs vs Message -> sns -> sqs

I have a task generator to generate task messages to SQS queue and a bunch of workers to poll the SQS queue to process the task. In this case, is there any benefit to let the task generator to publish messages to a SNS topic first, and then the SQS queue subscribes to the SNS topic? I assume directly publish to SQS queue is enough.
Assuming you're not needing to fan out the messages to different types of workers, and your workers are doing the same job then no you don't.
Each worker can take and process one message.
One item to be aware off is the timeouts before the messages become visable on SQS again. i.e. not configuring the timeouts correctly could cause another worker to process the same message.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html
When a consumer receives and processes a message from a queue, the
message remains in the queue. Amazon SQS doesn't automatically delete
the message. Because Amazon SQS is a distributed system, there's no
guarantee that the consumer actually receives the message (for
example, due to a connectivity issue, or due to an issue in the
consumer application). Thus, the consumer must delete the message from
the queue after receiving and processing it. Visibility Timeout
Immediately after a message is received, it remains in the queue. To
prevent other consumers from processing the message again, Amazon SQS
sets a visibility timeout, a period of time during which Amazon SQS
prevents other consumers from receiving and processing the message.
The default visibility timeout for a message is 30 seconds. The
minimum is 0 seconds. The maximum is 12 hours. For information about
configuring visibility timeout for a queue using the console

AWS lambda missing few SQS event miss leading to message in flight

My Lambda configuration is as below
Lambda Concurrency is set to 50
And SQS trigger batch size is set to 1
Issue:
When my queue is flooded with 200+ messages, some of the sqs triggers are missed and the message from the queue goes to inflight state without even triggering the lambda. This is adding a latency in processing by the timeout value set for lambda as I need to wait for the message to come out of flight for it to be reprocessed.
Any inputs will be highly appreciated.
SQS is integrated with Lambda through event source mappings.
Thanks to the mappings, the Lambda service is long polling the SQS queue, and invoking your function on your behalf. What's more it automatically removes the messages from the queue if your Lambda successfully processes them.
Since you want to process 200+ messages, and you set concurrency to 50 with batch size of 1, it means that you can process only 50 messages in parallel. The rest will be throttled. When this happens:
If your function is throttled, returns an error, or doesn't respond, the message becomes visible again. All messages in a failed batch return to the queue, so your function code must be able to process the same message multiple times without side effects.
To rectify the issue, the following two immediate actions can be considered:
increase concurrency of your function to 200 or more.
increase batch size to 10. With the batch size and concurrency of 50, you can process 500 (10 x 50) messages concurrently.
Also since you are heavily throttled, setting up a dead-letter queue can be useful. The DLQ helps captures problematic or missed messages from the queue, so that you can process them later or inspect:
If a message fails to be processed multiple times, Amazon SQS can send it to a dead-letter queue. When your function returns an error, Lambda leaves it in the queue. After the visibility timeout occurs, Lambda receives the message again. To send messages to a second queue after a number of receives, configure a dead-letter queue on your source queue.

SQS Redrive Delay

I have a lambda function triggered by an SQS queue. When the lambda fails to process a message, it redrives the message to a dead letter queue. The dead letter queue is configured with a delivery delay of 5 minutes, but any messages are visible and processed immediately.
The delivery delay seems to be getting ignored. Is this supposed to happen? Is there a way to configure a "redrive delay" for an SQS?
Looks like this is working as intended per this post from 2016 - https://forums.aws.amazon.com/thread.jspa?messageID=702896&#702896