I'm reading aws documents on deadletter queue and re-drive policy, and the document mentioned "The redrive policy specifies the source queue, the dead-letter queue, and the conditions under which Amazon SQS moves messages from the former to the latter if the consumer of the source queue fails to process a message a specified number of times".
However, even the document mentioned "message process failed" several times, I do not understand how sqs detects a message processing failure (and thus triggers re-drive or move to the dead letter queue.)
From what I understand, consumer applications call receiveMessage to retrieve the message from SQS, then process the message. The processing function is not passed in to receiveMessage as a lambda. So how does SQS know that message processing has failed?
When a client (e.g. a lambda function) gets a message from the queue, it has limited time to call DeleteMessage. Each msg has also visibility timeout. If the msg is not deleted by the client within the visibility timeout, SQS "assumes" that the processing failed.
Such messages can be then forwarded to SQS depending on how many failed attempts you setup to tollerate.
Related
Let's say that a SQS message was consumed by a consumer, and within the visibility timeout, the consumer gave an error. Now the SQS will try to retry the message, so will that be done after the current visibility timeout of the message completes, or will it be done ASAP?
When a message is retrieve from an Amazon SQS queue using ReceiveMessage(), the message(s) will be marked as invisible.
When the worker finishes processing the message, it should call DeleteMessage(), passing the ReceiptHandle of the message(s).
If the SQS queue does not receive a DeleteMessage() request within the timeout period, then the message(s) will reappear on the queue for processing.
Amazon SQS does not know when a "consumer gave an error". All it knows is whether an API call was made to SQS to delete the message, or to ask for the invisibility period to be extended. (This is slightly different if SQS is being accessed by AWS Lambda, but I will assume this is not the case in your situation.)
SQS will retry the message after the current visibility timeout has elapsed, meaning if the consuming lambda runs for x seconds before the error occurs, the retry will happen after currentVisbilityTimeout - x seconds.
I have a lambda function with SQS as its trigger. when lambda executes, either it throws an error or not. it will put the job back in the queue and creates a loop and you know about the AWS bill for sure :)
should I return something in lambda function to let SQS know that I got the message(done the job)? how should I ack the message? as far as I know we don't have ack and nack in SQS.
Is there any option in the SQS configuration to only retry N time if any job fails?
For standard uses cases you do not have to actively manage success-failure communication between lambda and SQS. If the lambda returns without error within the timeout period, SQS will know the message was successfully processed. If the function returns an error, then SQS will retry a configurable number of times and finally direct still-failing messages to a Dead Letter Queue (if configured).
Docs: Amazon SQS supports dead-letter queues, which other queues (source queues) can target for messages that can't be processed (consumed) successfully.
Important: Add your DLQ to the SQS queue, not the Lambda. Lambda DLQs are a way to handle errors for async (event-driven) invocation.
I have a task generator to generate task messages to SQS queue and a bunch of workers to poll the SQS queue to process the task. In this case, is there any benefit to let the task generator to publish messages to a SNS topic first, and then the SQS queue subscribes to the SNS topic? I assume directly publish to SQS queue is enough.
Assuming you're not needing to fan out the messages to different types of workers, and your workers are doing the same job then no you don't.
Each worker can take and process one message.
One item to be aware off is the timeouts before the messages become visable on SQS again. i.e. not configuring the timeouts correctly could cause another worker to process the same message.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html
When a consumer receives and processes a message from a queue, the
message remains in the queue. Amazon SQS doesn't automatically delete
the message. Because Amazon SQS is a distributed system, there's no
guarantee that the consumer actually receives the message (for
example, due to a connectivity issue, or due to an issue in the
consumer application). Thus, the consumer must delete the message from
the queue after receiving and processing it. Visibility Timeout
Immediately after a message is received, it remains in the queue. To
prevent other consumers from processing the message again, Amazon SQS
sets a visibility timeout, a period of time during which Amazon SQS
prevents other consumers from receiving and processing the message.
The default visibility timeout for a message is 30 seconds. The
minimum is 0 seconds. The maximum is 12 hours. For information about
configuring visibility timeout for a queue using the console
we have currently a sqs queue for processing incoming data. Is there a recommended way for managing two DLQs for one queue?
if there is a parsing error of the incoming data, then I want to move the message directly into a "userInput" DLQ, without redrives
if our mongo is on maxConnections, or any other error occurs, then the configured redrive policy should take place
Do I have to put the message manually into the dlq for the first szenario, or is there a better way?
Thanks!
An Amazon SQS queue only has one Dead Letter Queue.
If a message is read from an SQS queue more than a defined number of times, the message can be moved to the Dead Letter Queue for later processing. However, there is no control over what conditions will send the message to the Dead Letter Queue. It is simply based on a message being retrieved more than the maxReceiveCount.
See: Amazon SQS dead-letter queues
Please note that SQS itself does not process the message. Rather, you will have an app or an AWS Lambda function that reads the message from the queue and processes the message. Therefore, you could program your desired functionality (checking incoming data, responding to Mongo maxConnections) into the code that is processing the message from SQS. If it detects such a problem, that program could send the message to a specific queue, and then delete the original message from the source SQS queue.
This would have the same behaviour as having "multiple DLQs", except that your code is responsible for the logic of moving the messages to these queues, rather than Amazon SQS doing it.
SQS Supports only Single DLQ .
Alternatively what you could do is, Let the Consumer of the **Queue** Handle your first case. Meaning "if there is a parsing error of the incoming data" Let the Consumer Move it to another queue.
And The Second case of redrive policy will be handled Automatically and Moved to Real DLQ after the maxReceiveCount
You can have only one DLQ for an queue.
However, you could subscribe a lambda function to that one DLQ.
The lambda function could process the "bad" messages and distributed to other DQLs queues. So you could have additional DLQs for which the function would filter the messages.
My Lambda configuration is as below
Lambda Concurrency is set to 50
And SQS trigger batch size is set to 1
Issue:
When my queue is flooded with 200+ messages, some of the sqs triggers are missed and the message from the queue goes to inflight state without even triggering the lambda. This is adding a latency in processing by the timeout value set for lambda as I need to wait for the message to come out of flight for it to be reprocessed.
Any inputs will be highly appreciated.
SQS is integrated with Lambda through event source mappings.
Thanks to the mappings, the Lambda service is long polling the SQS queue, and invoking your function on your behalf. What's more it automatically removes the messages from the queue if your Lambda successfully processes them.
Since you want to process 200+ messages, and you set concurrency to 50 with batch size of 1, it means that you can process only 50 messages in parallel. The rest will be throttled. When this happens:
If your function is throttled, returns an error, or doesn't respond, the message becomes visible again. All messages in a failed batch return to the queue, so your function code must be able to process the same message multiple times without side effects.
To rectify the issue, the following two immediate actions can be considered:
increase concurrency of your function to 200 or more.
increase batch size to 10. With the batch size and concurrency of 50, you can process 500 (10 x 50) messages concurrently.
Also since you are heavily throttled, setting up a dead-letter queue can be useful. The DLQ helps captures problematic or missed messages from the queue, so that you can process them later or inspect:
If a message fails to be processed multiple times, Amazon SQS can send it to a dead-letter queue. When your function returns an error, Lambda leaves it in the queue. After the visibility timeout occurs, Lambda receives the message again. To send messages to a second queue after a number of receives, configure a dead-letter queue on your source queue.