The case I'm trying to face is - log the DLQ messages content, while avoiding ack the messages, in order to allow re-queue to original queue in the future.
I thought about subscribing two DLQs, when one is for logging, and second is for having the messages for re-queue, but its not possible
You have to consume the SQS messages from DLQ as they have limited retention period (4 days by default). Thus, if you not act on them withing this period they will expire and you will loose them.
Normally, you would setup a lambda function which will process messages from the DLQ. Subsequently, you can do what you wish with these messages, which can include:
re-broadcasting to the original queue
sending SNS notifications about failed messages
saving them into S3 or CloudWatch Logs for future analysis.
You can configure a processing lambda(Event Source mapping) to get invoked when message falls into the DLQ and log the message contents in your cloudwatch log.
But ideally people will write the DLQ message into a file with the processing lambda and store them into s3 for logging/ auditing purposes
Related
Let's say I have set up a AWS SNS with 3 subscribers. I'd like to know when all of the subscribers received/processed the message in order to mark that message as processed by all 3, and to generate some metrics.
Is there a way to do this?
You can log delivery status for SNS topics to CloudWatch, but only for certain types of messages (AWS has no reliable way of knowing if some messages were received or not, such as with SMS or email).
The types of messages you can log are:
HTTP
Lambda
SQS
Custom Application (must be configured to tell AWS that the message is received)
To set up logging in SNS:
In the SNS console, click "Edit Topic"
Expand "delivery status logging"
Then you can configure which protocols to log and the necessary permissions to do so.
Once you're logging to CloudWatch, you can draw metrics from there.
If you need to be notified when the subscribers have received the messages, you could set up a subscription filter within cloudwatch to send the relevant log events to a lambda function, in which you would implement custom logic to notify you appropriately.
I mean successful processing by the consumer
Usually your consumers would have to indicate this somehow. This is use-case specific, therefore its difficult to speculate on exact solutions.
But just to give an example, a popular patter is Request-response messaging pattern. In here, your your consumers would use a SQS queue to publish outcome of the message processing. The producer(s) would pull the queue to get these messages, subsequently, knowing which messages were correctly process and which not.
we have currently a sqs queue for processing incoming data. Is there a recommended way for managing two DLQs for one queue?
if there is a parsing error of the incoming data, then I want to move the message directly into a "userInput" DLQ, without redrives
if our mongo is on maxConnections, or any other error occurs, then the configured redrive policy should take place
Do I have to put the message manually into the dlq for the first szenario, or is there a better way?
Thanks!
An Amazon SQS queue only has one Dead Letter Queue.
If a message is read from an SQS queue more than a defined number of times, the message can be moved to the Dead Letter Queue for later processing. However, there is no control over what conditions will send the message to the Dead Letter Queue. It is simply based on a message being retrieved more than the maxReceiveCount.
See: Amazon SQS dead-letter queues
Please note that SQS itself does not process the message. Rather, you will have an app or an AWS Lambda function that reads the message from the queue and processes the message. Therefore, you could program your desired functionality (checking incoming data, responding to Mongo maxConnections) into the code that is processing the message from SQS. If it detects such a problem, that program could send the message to a specific queue, and then delete the original message from the source SQS queue.
This would have the same behaviour as having "multiple DLQs", except that your code is responsible for the logic of moving the messages to these queues, rather than Amazon SQS doing it.
SQS Supports only Single DLQ .
Alternatively what you could do is, Let the Consumer of the **Queue** Handle your first case. Meaning "if there is a parsing error of the incoming data" Let the Consumer Move it to another queue.
And The Second case of redrive policy will be handled Automatically and Moved to Real DLQ after the maxReceiveCount
You can have only one DLQ for an queue.
However, you could subscribe a lambda function to that one DLQ.
The lambda function could process the "bad" messages and distributed to other DQLs queues. So you could have additional DLQs for which the function would filter the messages.
I have deployed a AWS Lambda function that triggers when a SQS queue receives a message. The function makes a request to a Rest API and if the response is not Ok the SQS message needs to be processed again.
That's why I need to resend the message to the queue but I would prefer to delete the SQS messages programatically, although I can't find how to configure SQS. I have tried message retention but it seems the trigger event causes the message being deleted anyway.
Other possible options could be back up the message in S3 or persisting it in DynamoDB but I wonder if there's a better option.
Any insights on this question would be very helpful.
From AWS Lambda Retry Behavior - AWS Lambda:
If you configure an Amazon SQS queue as an event source, AWS Lambda will poll a batch of records in the queue and invoke your Lambda function. If the invocation fails or times out, every message in the batch will be returned to the queue, and each will be available for processing once the Visibility Timeout period expires. (Visibility timeouts are a period of time during which Amazon Simple Queue Service prevents other consumers from receiving and processing the message).
Once an invocation successfully processes a batch, each message in that batch will be removed from the queue. When a message is not successfully processed, it is either discarded or if you have configured an Amazon SQS Dead Letter Queue, the failure information will be directed there for you to analyze.
So, it seems (from reading this) that a simple option would be set a high visibility timeout on the queue and then raise an error if the function cannot process the message. This message will remain invisible for the configured timeout period, then would reappear on the queue for processing. If it exceeds the permitted number of retries, it would be deleted or moved to a Dead Letter Queue (if configured).
There is a lambda-powertools library created and maintained by AWSLabs and one of the feature is batch processing.
The batch processing utility handles partial failures when processing
batches from Amazon SQS, Amazon Kinesis Data Streams, and Amazon
DynamoDB Streams.
Check out the documentation here. This is the python version, but there are versions for other environments.
So after some research I found the following:
Frankly there was an workaround options to selectively filter out messages processed as good ones from a batch - before aws implemented it.
Kindly refer to approaches 1-3 demonstrated in here
As for using aws's implementation use approach No.4
We are evaluating SNS for our messaging requirements to integrate multiple applications. we have a single producer that publishes messages to multiple topics on SNS. Each topic has 2-5 subscribers. In event of subscriber failures (down for maintenance) I have a few questions on the recommended strategy of using SQS queues per consumer
Is it possible to configure SNS to push to SQS only in event of failure in delivering the message to a subscriber? Dumping all the messages in SQS queue creates a problem for the consumer to analyze all messages in the queue when it restarts.
In event of subscriber failure, it can read messages from SQS queue on restart but how would it know that it missed messages from SNS when it was overloaded?
Any suggestions on handling subscriber failures are welcome.
Thanks!
No, it is not possible to "configure SNS to push to SQS only in event of failure".
Rather than trying to recover a message after a failure, you can configure the Amazon SNS retry policies.
From Setting Amazon SNS Delivery Retry Policies for HTTP/HTTPS Endpoints:
You can use delivery policies to control not only the total number of retries, but also the time delay between each retry. You can specify up to 100 total retries distributed among four discrete phases. The maximum lifetime of a message in the system is one hour. This one hour limit cannot be extended by a delivery policy.
So, you don't need to worry as long as the destination is back online within an hour.
If it is likely to be offline for more than an hour, you will need to find a way to store and "replay" the messages, possibly by inspecting CloudWatch Logs.
Or, here's another idea...
Push initially to SQS. Have an AWS Lambda function triggered by SQS. The Lambda function can do the 'push' that would normally be done by SNS. If it fails, then the standard SQS invisibility process will retry it later, eventually going to a Dead Letter Queue.
I have the following pipeline in place to move events:-
Service -> SNS -> AWS Lambda -> Dynamo Db.
So, basically, Service is publishing data to SNS Topic which gets subscribed by AWS Lambda Function. Then, this AWS Lambda Function pushes the data to Dynamo Db. Now, I am adding a DLQ with AWS Lambda to store error processed messages.
Error messages can be due to an error in publisher application or consumer application. Eg. Publisher changed the format of data being published and say I am not supporting it in AWS Lambda and it gives some error.
I wanted to know after pushing to DLQ such messages, what do we normally do?
Do we try again to push the data by changing the AWS Lambda function? Is this step done manually or we make a job which pushes the data from DLQ to lambda function periodically?
We normally just put an alarm on DLQ and then manually handle this?
Since Sometimes, the issue can be due to Dynamo Db connection first time, which would be handled next time if we push. If we do it manually, then it would be a problem.
I’m addition to Lambda DLQs, you should consider adding SNS DLQs:
https://aws.amazon.com/blogs/compute/designing-durable-serverless-apps-with-dlqs-for-amazon-sns-amazon-sqs-aws-lambda/
I can comment here for SQS -> DLQ
Don't need to move the message because it will come with so many other challenges like duplicate messages, recovery scenarios, lost message, de-duplication check and etc.
Here is the solution which we implemented -
Usually, we use the DLQ for transient errors, not for permanent errors. So took below approach -
Read the message from DLQ like a regular queue
Benefits
To avoid duplicate message processing
Better control on DLQ- Like I put a check, to process only when the regular queue is completely processed.
Scale up the process based on the message on DLQ
Then follow the same code which regular queue is following.
More reliable in case of aborting the job or the process got terminated while processing (e.g. Instance killed or process terminated)
Benefits
Code reusability
Error handling
Recovery and message replay
Extend the message visibility so that no other thread process them.
Benefit
Avoid processing same record by multiple threads.
Delete the message only when either there is a permanent error or successful.
Benefit
Keep processing until we are getting a transient error.
AWS Lambda Dead Letter Queues directs events that cannot be processed to the Amazon SNS topic or Amazon SQS queue that you’ve configured for the Lambda function.
So handling the error with given payload, using a service subscribed to the SNS topic or reading messages from SQS is up to the developer to decide. Addressing the questions listed,
You can use another Lambda function subscribed to a SNS topic to process the message.
Yes, its more similar to setup alarm and manually handle it.
By default, a failed Lambda function invoked asynchronously is retried twice, and then the event is discarded unless there is a DLQ setup. So if its a dynamodb connection problem, probably solved in the second invocation.