Best way to move messages off DLQ in Amazon SQS? - amazon-web-services

What is the best practice to move messages from a dead letter queue back to the original queue in Amazon SQS?
Would it be
Get message from DLQ
Write message to queue
Delete message from DLQ
Or is there a simpler way?
Also, will AWS eventually have a tool in the console to move messages off the DLQ?

Here is a quick hack. This is definitely not the best or recommended option.
Set the main SQS queue as the DLQ for the actual DLQ with Maximum Receives as 1.
View the content in DLQ (This will move the messages to the main queue as this is the DLQ for the actual DLQ)
Remove the setting so that the main queue is no more the DLQ of the actual DLQ

On Dec 1 2021 AWS released the ability to redrive messages from a DLQ back to the source queue(or custom queue).
With dead-letter queue redrive to source queue, you can simplify and enhance your error-handling workflows for standard queues.
Source:
Introducing Amazon Simple Queue Service dead-letter queue redrive to source queues

There are a few scripts out there that do this for you:
npm / nodejs based: http://github.com/garryyao/replay-aws-dlq
# install
npm install replay-aws-dlq;
# use
npx replay-aws-dlq [source_queue_url] [dest_queue_url]
go based: https://github.com/mercury2269/sqsmover
# compile: https://github.com/mercury2269/sqsmover#compiling-from-source
# use
sqsmover -s [source_queue_url] -d [dest_queue_url]

Don't need to move the message because it will come with so many other challenges like duplicate messages, recovery scenarios, lost message, de-duplication check and etc.
Here is the solution which we implemented -
Usually, we use the DLQ for transient errors, not for permanent errors. So took below approach -
Read the message from DLQ like a regular queue
Benefits
To avoid duplicate message processing
Better control on DLQ- Like I put a check, to process only when the regular queue is completely processed.
Scale up the process based on the message on DLQ
Then follow the same code which regular queue is following.
More reliable in case of aborting the job or the process got terminated while processing (e.g. Instance killed or process terminated)
Benefits
Code reusability
Error handling
Recovery and message replay
Extend the message visibility so that no other thread process them.
Benefit
Avoid processing same record by multiple threads.
Delete the message only when either there is a permanent error or successful.
Benefit
Keep processing until we are getting a transient error.

I wrote a small python script to do this, by using boto3 lib:
conf = {
"sqs-access-key": "",
"sqs-secret-key": "",
"reader-sqs-queue": "",
"writer-sqs-queue": "",
"message-group-id": ""
}
import boto3
client = boto3.client(
'sqs',
aws_access_key_id = conf.get('sqs-access-key'),
aws_secret_access_key = conf.get('sqs-secret-key')
)
while True:
messages = client.receive_message(QueueUrl=conf['reader-sqs-queue'], MaxNumberOfMessages=10, WaitTimeSeconds=10)
if 'Messages' in messages:
for m in messages['Messages']:
print(m['Body'])
ret = client.send_message( QueueUrl=conf['writer-sqs-queue'], MessageBody=m['Body'], MessageGroupId=conf['message-group-id'])
print(ret)
client.delete_message(QueueUrl=conf['reader-sqs-queue'], ReceiptHandle=m['ReceiptHandle'])
else:
print('Queue is currently empty or messages are invisible')
break
you can get this script in this link
this script basically can move messages between any arbitrary queues. and it supports fifo queues as well as you can supply the message_group_id field.

That looks like your best option. There is a possibility that your process fails after step 2. In that case you'll end up copying the message twice, but you application should be handling re-delivery of messages (or not care) anyway.

here:
import boto3
import sys
import Queue
import threading
work_queue = Queue.Queue()
sqs = boto3.resource('sqs')
from_q_name = sys.argv[1]
to_q_name = sys.argv[2]
print("From: " + from_q_name + " To: " + to_q_name)
from_q = sqs.get_queue_by_name(QueueName=from_q_name)
to_q = sqs.get_queue_by_name(QueueName=to_q_name)
def process_queue():
while True:
messages = work_queue.get()
bodies = list()
for i in range(0, len(messages)):
bodies.append({'Id': str(i+1), 'MessageBody': messages[i].body})
to_q.send_messages(Entries=bodies)
for message in messages:
print("Coppied " + str(message.body))
message.delete()
for i in range(10):
t = threading.Thread(target=process_queue)
t.daemon = True
t.start()
while True:
messages = list()
for message in from_q.receive_messages(
MaxNumberOfMessages=10,
VisibilityTimeout=123,
WaitTimeSeconds=20):
messages.append(message)
work_queue.put(messages)
work_queue.join()

DLQ comes into play only when the original consumer fails to consume message successfully after various attempts. We do not want to delete the message since we believe we can still do something with it (maybe attempt to process again or log it or collect some stats) and we do not want to keep encountering this message again and again and stop the ability to process other messages behind this one.
DLQ is nothing but just another queue. Which means we would need to write a consumer for DLQ that would ideally run less frequently (compared to original queue) that would consume from DLQ and produce message back into the original queue and delete it from DLQ - if thats the intended behavior and we think original consumer would be now ready to process it again. It should be OK if this cycle continues for a while since we now also get an opportunity to manually inspect and make necessary changes and deploy another version of original consumer without losing the message (within the message retention period of course - which is 4 days by default).
Would be nice if AWS provides this capability out of the box but I don't see it yet - they're leaving this to the end user to use it in way they feel appropriate.

There is a another way to achieve this without writing single line of code.
Consider your actual queue name is SQS_Queue and the DLQ for it is SQS_DLQ.
Now follow these steps:
Set SQS_Queue as the dlq of SQS_DLQ. Since SQS_DLQ is already a dlq of SQS_Queue. Now, both are acting as the dlq of the other.
Set max receive count of your SQS_DLQ to 1.
Now read messages from SQS_DLQ console. Since message receive count is 1, it will send all the message to its own dlq which is your actual SQS_Queue queue.

We use the following script to redrive message from src queue to tgt queue:
filename: redrive.py
usage: python redrive.py -s {source queue name} -t {target queue name}
'''
This script is used to redrive message in (src) queue to (tgt) queue
The solution is to set the Target Queue as the Source Queue's Dead Letter Queue.
Also set Source Queue's redrive policy, Maximum Receives to 1.
Also set Source Queue's VisibilityTimeout to 5 seconds (a small period)
Then read data from the Source Queue.
Source Queue's Redrive Policy will copy the message to the Target Queue.
'''
import argparse
import json
import boto3
sqs = boto3.client('sqs')
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('-s', '--src', required=True,
help='Name of source SQS')
parser.add_argument('-t', '--tgt', required=True,
help='Name of targeted SQS')
args = parser.parse_args()
return args
def verify_queue(queue_name):
queue_url = sqs.get_queue_url(QueueName=queue_name)
return True if queue_url.get('QueueUrl') else False
def get_queue_attribute(queue_url):
queue_attributes = sqs.get_queue_attributes(
QueueUrl=queue_url,
AttributeNames=['All'])['Attributes']
print(queue_attributes)
return queue_attributes
def main():
args = parse_args()
for q in [args.src, args.tgt]:
if not verify_queue(q):
print(f"Cannot find {q} in AWS SQS")
src_queue_url = sqs.get_queue_url(QueueName=args.src)['QueueUrl']
target_queue_url = sqs.get_queue_url(QueueName=args.tgt)['QueueUrl']
target_queue_attributes = get_queue_attribute(target_queue_url)
# Set the Source Queue's Redrive policy
redrive_policy = {
'deadLetterTargetArn': target_queue_attributes['QueueArn'],
'maxReceiveCount': '1'
}
sqs.set_queue_attributes(
QueueUrl=src_queue_url,
Attributes={
'VisibilityTimeout': '5',
'RedrivePolicy': json.dumps(redrive_policy)
}
)
get_queue_attribute(src_queue_url)
# read all messages
num_received = 0
while True:
try:
resp = sqs.receive_message(
QueueUrl=src_queue_url,
MaxNumberOfMessages=10,
AttributeNames=['All'],
WaitTimeSeconds=5)
num_message = len(resp.get('Messages', []))
if not num_message:
break
num_received += num_message
except Exception:
break
print(f"Redrive {num_received} messages")
# Reset the Source Queue's Redrive policy
sqs.set_queue_attributes(
QueueUrl=src_queue_url,
Attributes={
'VisibilityTimeout': '30',
'RedrivePolicy': ''
}
)
get_queue_attribute(src_queue_url)
if __name__ == "__main__":
main()

AWS Lambda solution worked well for us -
Detailed instructions:
https://serverlessrepo.aws.amazon.com/applications/arn:aws:serverlessrepo:us-east-1:303769779339:applications~aws-sqs-dlq-redriver
Github: https://github.com/honglu/aws-sqs-dlq-redriver.
Deployed with a click and another click to start the redrive!

Here is also the script (written in Typescript) to move the messages from one AWS queue to another one. Maybe it will be useful for someone.
import {
SQSClient,
ReceiveMessageCommand,
DeleteMessageBatchCommand,
SendMessageBatchCommand,
} from '#aws-sdk/client-sqs'
const AWS_REGION = 'eu-west-1'
const AWS_ACCOUNT = '12345678901'
const DLQ = `https://sqs.${AWS_REGION}.amazonaws.com/${AWS_ACCOUNT}/dead-letter-queue`
const QUEUE = `https://sqs.${AWS_REGION}.amazonaws.com/${AWS_ACCOUNT}/queue`
const loadMessagesFromDLQ = async () => {
const client = new SQSClient({region: AWS_REGION})
const command = new ReceiveMessageCommand({
QueueUrl: DLQ,
MaxNumberOfMessages: 10,
VisibilityTimeout: 60,
})
const response = await client.send(command)
console.log('---------LOAD MESSAGES----------')
console.log(`Loaded: ${response.Messages?.length}`)
console.log(JSON.stringify(response, null, 4))
return response
}
const sendMessagesToQueue = async (entries: Array<{Id: string, MessageBody: string}>) => {
const client = new SQSClient({region: AWS_REGION})
const command = new SendMessageBatchCommand({
QueueUrl: QUEUE,
Entries: entries.map(entry => ({...entry, DelaySeconds: 10})),
// [
// {
// Id: '',
// MessageBody: '',
// DelaySeconds: 10
// }
// ]
})
const response = await client.send(command)
console.log('---------SEND MESSAGES----------')
console.log(`Send: Successful - ${response.Successful?.length}, Failed: ${response.Failed?.length}`)
console.log(JSON.stringify(response, null, 4))
}
const deleteMessagesFromQueue = async (entries: Array<{Id: string, ReceiptHandle: string}>) => {
const client = new SQSClient({region: AWS_REGION})
const command = new DeleteMessageBatchCommand({
QueueUrl: DLQ,
Entries: entries,
// [
// {
// "Id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
// "ReceiptHandle": "someReceiptHandle"
// }
// ]
})
const response = await client.send(command)
console.log('---------DELETE MESSAGES----------')
console.log(`Delete: Successful - ${response.Successful?.length}, Failed: ${response.Failed?.length}`)
console.log(JSON.stringify(response, null, 4))
}
const run = async () => {
const dlqMessageList = await loadMessagesFromDLQ()
if (!dlqMessageList || !dlqMessageList.Messages) {
console.log('There is no messages in DLQ')
return
}
const sendMsgList: any = dlqMessageList.Messages.map(msg => ({ Id: msg.MessageId, MessageBody: msg.Body}))
const deleteMsgList: any = dlqMessageList.Messages.map(msg => ({ Id: msg.MessageId, ReceiptHandle: msg.ReceiptHandle}))
await sendMessagesToQueue(sendMsgList)
await deleteMessagesFromQueue(deleteMsgList)
}
run()
P.S. The script is with room for improvement, but anyway might be useful.

here is a simple python script you can use from the cli to do the same, depending only on boto3
usage
python redrive_messages __from_queue_name__ __to_queue_name__
code
import sys
import boto3
from src.utils.get_config.get_config import get_config
from src.utils.get_logger import get_logger
sqs = boto3.resource('sqs')
config = get_config()
log = get_logger()
def redrive_messages(from_queue_name:str, to_queue_name:str):
# initialize the queues
from_queue = sqs.get_queue_by_name(QueueName=from_queue_name)
to_queue = sqs.get_queue_by_name(QueueName=to_queue_name)
# begin querying for messages
should_check_for_more = True
messages_processed = []
while (should_check_for_more):
# grab the next message
messages = from_queue.receive_messages(MaxNumberOfMessages=1);
if (len(messages) == 0):
should_check_for_more = False;
break;
message = messages[0]
# requeue it
to_queue.send_message(MessageBody=message.body, DelaySeconds=0)
# let the queue know that the message was processed successfully
messages_processed.append(message)
message.delete()
print(f'requeued {len(messages_processed)} messages')
if __name__ == '__main__':
from_queue_name = sys.argv[1]
to_queue_name = sys.argv[2]
redrive_messages(from_queue_name, to_queue_name)

Related

Delete message from SQS DLQ

How to delete a message whose MessageId is known from a SQS DLQ which is having around 10k messsages?
I tried to do it using lambda function but the max messages that can be received using that are only 10.
What are the best ways to do this?
It seems your question is for just one message, but in any case, you can use deleteMessageBatch method. In a node server, you would have this:
const AWS = require("aws-sdk");
const messageIds = ["1", "2", "3"];
const removeNotificationFromQueueData = messageIds.map((id) => {
return {
ReceiptHandle: id,
Id: id,
};
});
const sqs = new AWS.SQS();
sqs
.deleteMessageBatch({
QueueUrl: "http://myQueueUrlEndpoint",
Entries: removeNotificationFromQueueData,
})
.promise();
Another way of using python script
#!/usr/bin/env python
import boto3
def delete_msg_id(queue_url, message_id):
"""Delete message based on message ID
:param queue_url: URL of the SQS queue to read.
:param message_id: message id to be deleted
"""
sqs_client = boto3.client("sqs")
while True:
resp = sqs_client.receive_message(
QueueUrl=queue_url, AttributeNames=["All"], MaxNumberOfMessages=10, WaitTimeSeconds=1
)
for msg in resp["Messages"]:
try:
if msg["MessageId"] == message_id:
sqs_client.delete_mesage(QueueUrl=queue_url, ReceiptHandle=msg["ReceiptHandle"])
except KeyError:
print('failed')
continue
if __name__ == '__main__':
delete_msg_id('my_sqs_url', 'my_msg_id')
Ref to Moving Big Number Of Messages Between SQS Queues

GCP Cloud Tasks: shorten period for creating a previously created named task

We are developing a GCP Cloud Task based queue process that sends a status email whenever a particular Firestore doc write-trigger fires. The reason we use Cloud Tasks is so a delay can be created (using scheduledTime property 2-min in the future) before the email is sent, and to control dedup (by using a task-name formatted as: [firestore-collection-name]-[doc-id]) since the 'write' trigger on the Firestore doc can be fired several times as the document is being created and then quickly updated by backend cloud functions.
Once the task's delay period has been reached, the cloud-task runs, and the email is sent with updated Firestore document info included. After which the task is deleted from the queue and all is good.
Except:
If the user updates the Firestore doc (say 20 or 30 min later) we want to resend the status email but are unable to create the task using the same task-name. We get the following error:
409 The task cannot be created because a task with this name existed too recently. For more information about task de-duplication see https://cloud.google.com/tasks/docs/reference/rest/v2/projects.locations.queues.tasks/create#body.request_body.FIELDS.task.
This was unexpected as the queue is empty at this point as the last task completed succesfully. The documentation referenced in the error message says:
If the task's queue was created using Cloud Tasks, then another task
with the same name can't be created for ~1hour after the original task
was deleted or executed.
Question: is there some way in which this restriction can be by-passed by lowering the amount of time, or even removing the restriction all together?
The short answer is No. As you've already pointed, the docs are very clear regarding this behavior and you should wait 1 hour to create a task with same name as one that was previously created. The API or Client Libraries does not allow to decrease this time.
Having said that, I would suggest that instead of using the same Task ID, use different ones for the task and add an identifier in the body of the request. For example, using Python:
from google.cloud import tasks_v2
from google.protobuf import timestamp_pb2
import datetime
def create_task(project, queue, location, payload=None, in_seconds=None):
client = tasks_v2.CloudTasksClient()
parent = client.queue_path(project, location, queue)
task = {
'app_engine_http_request': {
'http_method': 'POST',
'relative_uri': '/task/'+queue
}
}
if payload is not None:
converted_payload = payload.encode()
task['app_engine_http_request']['body'] = converted_payload
if in_seconds is not None:
d = datetime.datetime.utcnow() + datetime.timedelta(seconds=in_seconds)
timestamp = timestamp_pb2.Timestamp()
timestamp.FromDatetime(d)
task['schedule_time'] = timestamp
response = client.create_task(parent, task)
print('Created task {}'.format(response.name))
print(response)
#You can change DOCUMENT_ID with USER_ID or something to identify the task
create_task(PROJECT_ID, QUEUE, REGION, DOCUMENT_ID)
Facing a similar problem of requiring to debounce multiple instances of Firestore write-trigger functions, we worked around the default Cloud Tasks task-name based dedup mechanism (still a constraint in Nov 2022) by building a small debounce "helper" using Firestore transactions.
We're using a helper collection _syncHelper_ to implement a delayed throttle for side effects of write-trigger fires - in the OP's case, send 1 email for all writes within 2 minutes.
In our case we are using Firebease Functions task queue utils and not directly interacting with Cloud Tasks but thats immaterial to the solution. The key is to determine the task's execution time in advance and use that as the "dedup key":
async function enqueueTask(shopId) {
const queueName = 'doSomething';
const now = new Date();
const next = new Date(now.getTime() + 2 * 60 * 1000);
try {
const shouldEnqueue = await getFirestore().runTransaction(async t=>{
const syncRef = getFirestore().collection('_syncHelper_').doc(<collection_id-doc_id>);
const doc = await t.get(syncRef);
let data = doc.data();
if (data?.timestamp.toDate()> now) {
return false;
}
await t.set(syncRef, { timestamp: Timestamp.fromDate(next) });
return true;
});
if (shouldEnqueue) {
let queue = getFunctions().taskQueue(queueName);
await queue.enqueue({
timestamp: next.toISOString(),
},
{ scheduleTime: next }); }
} catch {
...
}
}
This will ensure a new task is enqueued only if the "next execution" time has passed.
The execution operation (also a cloud function in our case) will remove the sync data entry if it hasn't been changed since it was executed:
exports.doSomething = functions.tasks.taskQueue({
retryConfig: {
maxAttempts: 2,
minBackoffSeconds: 60,
},
rateLimits: {
maxConcurrentDispatches: 2,
}
}).onDispatch(async data => {
let { timestamp } = data;
await sendYourEmailHere();
await getFirestore().runTransaction(async t => {
const syncRef = getFirestore().collection('_syncHelper_').doc(<collection_id-doc_id>);
const doc = await t.get(syncRef);
let data = doc.data();
if (data?.timestamp.toDate() <= new Date(timestamp)) {
await t.delete(syncRef);
}
});
});
This isn't a bullet proof solution (if the doSomething() execution function has high latency for example) but good enough for 99% of our use cases.

Re-process DLQ events in Lambda

I have an AWS Lambda Function 'A' with a SQS DeadLetterQueue configured. When the Lambda fails to process an event, this is correctly sent to the DLQ. Is there a way to re-process events that ended into a DLQ?
I found two solution, but they both have drawbacks:
Create a new Lambda Function 'B' that reads from the SQS and then sends the events one by one to the previous Lambda 'A'. -> Here I have to write new code and deploy a new Function
Trigger again Lambda 'A' just when an event arrives in the SQS -> This looks dangerous as I can incur in looping executions
My ideal solution should be re-processing on demand the discarded events with Lambda 'A', without creating a new Lambda 'B' from scratch. Is there a way to accomplish this?
Finally, I didn't find any solution from AWS to reprocess the DLQ events of a Lambda Function. Then I created my own custom Lambda Function (I hope that this will be helpful to other developers with same issue):
import boto3
lamb = boto3.client('lambda')
sqs = boto3.resource('sqs')
queue = sqs.get_queue_by_name(QueueName='my_dlq_name')
def lambda_handler(event, context):
for _ in range(100):
messages_to_delete = []
for message in queue.receive_messages(MaxNumberOfMessages=10):
payload_bytes_array = bytes(message.body, encoding='utf8')
# print(payload_bytes_array)
lamb.invoke(
FunctionName='my_lambda_name',
InvocationType="Event", # Event = Invoke the function asynchronously.
Payload=payload_bytes_array
)
# Add message to delete
messages_to_delete.append({
'Id': message.message_id,
'ReceiptHandle': message.receipt_handle
})
# If you don't receive any notifications the messages_to_delete list will be empty
if len(messages_to_delete) == 0:
break
# Delete messages to remove them from SQS queue handle any errors
else:
deleted = queue.delete_messages(Entries=messages_to_delete)
print(deleted)
Part of the code is inspired by this post

SQS deleting automatically messages after receiving them by Lambda

I have an SQS that triggers a Lambda function. The Lambda function is just receiving the messsage and putting it in a DynamoDB.
It works fine, but the problem is that i noted that the message is deleted from the SQS without the need to add delete() statement in my code.
But in the code it's clearly mentionned that the message should be manually deleted by the consumer otherwise it wil be putted again in the SQS.
What's going on here ?
I want to deal with situation where there will be a problem with the process and in that case the message should reappear again in the SQS so another Lambda can try to process it.
Here is my Lambda code :
import json
import time
import boto3
def lambda_handler(event, context):
message_id = event['Records'][0]['messageId']
message_receipt_handle = event['Records'][0]['receiptHandle']
message_body = event['Records'][0]['body']
print('Message received :')
print(message_body)
print('Processing message ...')
dynamo_db = boto3.client('dynamodb')
response_db = dynamo_db.put_item(
TableName='sqs-test-sbx',
Item={
'id': {
'S': message_id,
},
'Message': {
'S': message_body,
}
}
)
print('dynamodb response :')
print(response_db)
# Simulate a proceesing ...
time.sleep(10)
print('Message processed')
return {
'statusCode': 200,
'message_id': message_id,
'message_body': message_body,
'event': json.dumps(event)
}
That is normal behavior, when you trigger the lambda directly from SQS
https://docs.aws.amazon.com/en_gb/lambda/latest/dg/with-sqs.html
When your function successfully processes a batch, Lambda deletes its
messages from the queue.
You need to delete the message, when you fetch the messages by your own from SQS for instancde from a EC2 instance.

boto3: can't find queue that was immediately created before

I create an SQS queue in boto3 and immediately look for it via sqs.list_queues but it won't return anything.
when I input the SQS queue name into the console, it won't return anything until I input it again the second time.
So does this mean I need to call list_queues twice? Why is this happening? Why isn't AWS return queues that was immediately created before?
sqs = boto3.client('sqs')
myQ = sqs.create_queue(QueueName='just_created')
response = sqs.list_queues(
QueueNamePrefix='just_created'
)
response does not contain the usual array of QueueUrls
Just like many AWS services, SQS control plane is eventually consistent, meaning that it takes a while to propagate the data accross the systems.
If you need the URL of the queue you just created, you can find it in the return value of the create_queue call.
The following operation creates an SQS queue named MyQueue.
response = client.create_queue(
QueueName='MyQueue',
)
print(response)
Expected Output:
{
'QueueUrl': 'https://queue.amazonaws.com/012345678910/MyQueue',
'ResponseMetadata': {
'...': '...',
},
}