Re-process DLQ events in Lambda - amazon-web-services

I have an AWS Lambda Function 'A' with a SQS DeadLetterQueue configured. When the Lambda fails to process an event, this is correctly sent to the DLQ. Is there a way to re-process events that ended into a DLQ?
I found two solution, but they both have drawbacks:
Create a new Lambda Function 'B' that reads from the SQS and then sends the events one by one to the previous Lambda 'A'. -> Here I have to write new code and deploy a new Function
Trigger again Lambda 'A' just when an event arrives in the SQS -> This looks dangerous as I can incur in looping executions
My ideal solution should be re-processing on demand the discarded events with Lambda 'A', without creating a new Lambda 'B' from scratch. Is there a way to accomplish this?

Finally, I didn't find any solution from AWS to reprocess the DLQ events of a Lambda Function. Then I created my own custom Lambda Function (I hope that this will be helpful to other developers with same issue):
import boto3
lamb = boto3.client('lambda')
sqs = boto3.resource('sqs')
queue = sqs.get_queue_by_name(QueueName='my_dlq_name')
def lambda_handler(event, context):
for _ in range(100):
messages_to_delete = []
for message in queue.receive_messages(MaxNumberOfMessages=10):
payload_bytes_array = bytes(message.body, encoding='utf8')
# print(payload_bytes_array)
lamb.invoke(
FunctionName='my_lambda_name',
InvocationType="Event", # Event = Invoke the function asynchronously.
Payload=payload_bytes_array
)
# Add message to delete
messages_to_delete.append({
'Id': message.message_id,
'ReceiptHandle': message.receipt_handle
})
# If you don't receive any notifications the messages_to_delete list will be empty
if len(messages_to_delete) == 0:
break
# Delete messages to remove them from SQS queue handle any errors
else:
deleted = queue.delete_messages(Entries=messages_to_delete)
print(deleted)
Part of the code is inspired by this post

Related

Kinesis Stream not getting the logs

I am receiving cloudtrail logs in Kinesis data stream. I am invoking a stream processing lambda function as described here. The final result that gets returned to the stream is then stored onto an S3 bucket. As of now, the processing fails with the following error file created in the S3 bucket:
{"attemptsMade":4,"arrivalTimestamp":1619677225356,"errorCode":"Lambda.FunctionError","errorMessage":"Check your function and make sure the output is in required format. In addition to that, make sure the processed records contain valid result status of Dropped, Ok, or ProcessingFailed","attemptEndingTimestamp":1619677302684,
Adding in the Python lambda function here for reference:
import base64
import gzip
import json
import logging
# Setup logging configuration
logging.basicConfig()
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
def unpack_kinesis_stream_records(event):
# decode and decompress each base64 encoded data element
return [gzip.decompress(base64.b64decode(k["data"])).decode('utf-8') for k in event["records"]]
def decode_raw_cloud_trail_events(cloudTrailEventDataList):
#Convert Raw Event Data List
eventList = [json.loads(e) for e in cloudTrailEventDataList]
#Filter out-non DATA_MESSAGES
filteredEvents = [e for e in eventList if e["messageType"] == 'DATA_MESSAGE']
#Covert each indidual log Event Message
events = []
for f in filteredEvents:
for e in f["logEvents"]:
events.append(json.loads(e["message"]))
logger.info("{0} Event Logs Decoded".format(len(events)))
return events
def handle_request(event, context):
#Log Raw Kinesis Stream Records
#logger.debug(json.dumps(event, indent=4))
# Unpack Kinesis Stream Records
kinesisData = unpack_kinesis_stream_records(event)
#[logger.debug(k) for k in kinesisData]
# Decode and filter events
events = decode_raw_cloud_trail_events(kinesisData)
####### INTEGRATION CODE GOES HERE #########
return f"Successfully processed {len(events)} records."
def lambda_handler(event, context):
return handle_request(event, context)
Can anyone help me understand the problem here.
I believe you are using 'kinesis firehose' service and not 'kinesis data stream'. code you are using is used to read directly from kinesis data stream and process cloudtrail events.
kinesis firehose data transformation lambda function is different. Firehose sends received cloudtrail events to lambda function. Lambda process/transform the events and should send those events back to firehose, so that firehose can deliver them to destination S3 bucket.
Your lambda function should return records in exactly same format as firehose expects them and each record should have either of the status [Dropped, Ok, or ProcessingFailed]. You can read more in aws doc

SQS deleting automatically messages after receiving them by Lambda

I have an SQS that triggers a Lambda function. The Lambda function is just receiving the messsage and putting it in a DynamoDB.
It works fine, but the problem is that i noted that the message is deleted from the SQS without the need to add delete() statement in my code.
But in the code it's clearly mentionned that the message should be manually deleted by the consumer otherwise it wil be putted again in the SQS.
What's going on here ?
I want to deal with situation where there will be a problem with the process and in that case the message should reappear again in the SQS so another Lambda can try to process it.
Here is my Lambda code :
import json
import time
import boto3
def lambda_handler(event, context):
message_id = event['Records'][0]['messageId']
message_receipt_handle = event['Records'][0]['receiptHandle']
message_body = event['Records'][0]['body']
print('Message received :')
print(message_body)
print('Processing message ...')
dynamo_db = boto3.client('dynamodb')
response_db = dynamo_db.put_item(
TableName='sqs-test-sbx',
Item={
'id': {
'S': message_id,
},
'Message': {
'S': message_body,
}
}
)
print('dynamodb response :')
print(response_db)
# Simulate a proceesing ...
time.sleep(10)
print('Message processed')
return {
'statusCode': 200,
'message_id': message_id,
'message_body': message_body,
'event': json.dumps(event)
}
That is normal behavior, when you trigger the lambda directly from SQS
https://docs.aws.amazon.com/en_gb/lambda/latest/dg/with-sqs.html
When your function successfully processes a batch, Lambda deletes its
messages from the queue.
You need to delete the message, when you fetch the messages by your own from SQS for instancde from a EC2 instance.

boto3: can't find queue that was immediately created before

I create an SQS queue in boto3 and immediately look for it via sqs.list_queues but it won't return anything.
when I input the SQS queue name into the console, it won't return anything until I input it again the second time.
So does this mean I need to call list_queues twice? Why is this happening? Why isn't AWS return queues that was immediately created before?
sqs = boto3.client('sqs')
myQ = sqs.create_queue(QueueName='just_created')
response = sqs.list_queues(
QueueNamePrefix='just_created'
)
response does not contain the usual array of QueueUrls
Just like many AWS services, SQS control plane is eventually consistent, meaning that it takes a while to propagate the data accross the systems.
If you need the URL of the queue you just created, you can find it in the return value of the create_queue call.
The following operation creates an SQS queue named MyQueue.
response = client.create_queue(
QueueName='MyQueue',
)
print(response)
Expected Output:
{
'QueueUrl': 'https://queue.amazonaws.com/012345678910/MyQueue',
'ResponseMetadata': {
'...': '...',
},
}

Extracting EC2InstanceId from SNS/SQS Auto Scaling message

I'm using python Boto3 code, when an instance is terminated from Auto Scaling group it notifies SNS which publishes the message to SQS. Lambda is also triggered when SNS is notified, which executes a boto script to grab the message from SQS.
I am using reference code from Sending and Receiving Messages in Amazon SQS.
Here is the code snippet:
if messages.get('Messages'):
m = messages.get('Messages')[0]
body = m['Body']
print('Received and deleted message: %s' % body)
The result is:
START RequestId: 1234-xxxxxxxx Version: $LATEST
{
"Type" : "Notification",
"MessageId" : "d1234xxxxxx",
"TopicArn" : "arn:aws:sns:us-east-1:xxxxxxxxxx:AutoScale-Topic",
"Subject" : "Auto Scaling: termination for group \"ASG\"",
"Message" : "{\"Progress\":50,\"AccountId\":\"xxxxxxxxx\",\"Description\":\"Terminating EC2 instance: i-123456\",\"RequestId\":\"db-xxxxx\",\"EndTime\":\"2017-07-13T22:17:19.678Z\",\"AutoScalingGroupARN\":\"arn:aws:autoscaling:us-east-1:360695249386:autoScalingGroup:fef71649-b184xxxxxx:autoScalingGroupName/ASG\",\"ActivityId\":\"db123xx\",\"EC2InstanceId\":\"i-123456\",\"StatusCode\"\"}",
"Timestamp" : "2017-07-",
"SignatureVersion" : "1",
"Signature" : "",
"SigningCertURL" : "https://sns.us-east-1.amazonaws.com/..",
"UnsubscribeURL" : "https://sns.us-east-1.amazonaws.com/
}
I only need EC2InstanceId of the terminated instance not the whole message. How can I extract the ID?
If your goal is to execute an AWS Lambda function (having the EC2 Instance ID as a parameter), there is no need to also publish the message to an Amazon SQS queue. In fact, this would be unreliable because you cannot guarantee that the message being retrieved from the SQS queue matches the invocation of your Lambda function.
Fortunately, when Auto Scaling sends an event to SNS and SNS then triggers a Lambda function, SNS passes the necessary information directly to the Lambda function.
Start your Lambda function with this code (or similar):
def lambda_handler(event, context):
# Dump the event to the log, for debugging purposes
print("Received event: " + json.dumps(event, indent=2))
# Extract the EC2 instance ID from the Auto Scaling event notification
message = event['Records'][0]['Sns']['Message']
autoscalingInfo = json.loads(message)
ec2InstanceId = autoscalingInfo['EC2InstanceId']
Your code then has the EC2 Instance ID, without having to use Amazon SQS.
The instance id is in the message. It's raw JSON, so you can parse it with the json package and get the information.
import json
if messages.get('Messages'):
m = messages.get('Messages')[0]
body = m['Body']
notification_message = json.loads(body["Message"])
print('instance id is: %s' % notification_message["EC2InstanceId"])

Best way to move messages off DLQ in Amazon SQS?

What is the best practice to move messages from a dead letter queue back to the original queue in Amazon SQS?
Would it be
Get message from DLQ
Write message to queue
Delete message from DLQ
Or is there a simpler way?
Also, will AWS eventually have a tool in the console to move messages off the DLQ?
Here is a quick hack. This is definitely not the best or recommended option.
Set the main SQS queue as the DLQ for the actual DLQ with Maximum Receives as 1.
View the content in DLQ (This will move the messages to the main queue as this is the DLQ for the actual DLQ)
Remove the setting so that the main queue is no more the DLQ of the actual DLQ
On Dec 1 2021 AWS released the ability to redrive messages from a DLQ back to the source queue(or custom queue).
With dead-letter queue redrive to source queue, you can simplify and enhance your error-handling workflows for standard queues.
Source:
Introducing Amazon Simple Queue Service dead-letter queue redrive to source queues
There are a few scripts out there that do this for you:
npm / nodejs based: http://github.com/garryyao/replay-aws-dlq
# install
npm install replay-aws-dlq;
# use
npx replay-aws-dlq [source_queue_url] [dest_queue_url]
go based: https://github.com/mercury2269/sqsmover
# compile: https://github.com/mercury2269/sqsmover#compiling-from-source
# use
sqsmover -s [source_queue_url] -d [dest_queue_url]
Don't need to move the message because it will come with so many other challenges like duplicate messages, recovery scenarios, lost message, de-duplication check and etc.
Here is the solution which we implemented -
Usually, we use the DLQ for transient errors, not for permanent errors. So took below approach -
Read the message from DLQ like a regular queue
Benefits
To avoid duplicate message processing
Better control on DLQ- Like I put a check, to process only when the regular queue is completely processed.
Scale up the process based on the message on DLQ
Then follow the same code which regular queue is following.
More reliable in case of aborting the job or the process got terminated while processing (e.g. Instance killed or process terminated)
Benefits
Code reusability
Error handling
Recovery and message replay
Extend the message visibility so that no other thread process them.
Benefit
Avoid processing same record by multiple threads.
Delete the message only when either there is a permanent error or successful.
Benefit
Keep processing until we are getting a transient error.
I wrote a small python script to do this, by using boto3 lib:
conf = {
"sqs-access-key": "",
"sqs-secret-key": "",
"reader-sqs-queue": "",
"writer-sqs-queue": "",
"message-group-id": ""
}
import boto3
client = boto3.client(
'sqs',
aws_access_key_id = conf.get('sqs-access-key'),
aws_secret_access_key = conf.get('sqs-secret-key')
)
while True:
messages = client.receive_message(QueueUrl=conf['reader-sqs-queue'], MaxNumberOfMessages=10, WaitTimeSeconds=10)
if 'Messages' in messages:
for m in messages['Messages']:
print(m['Body'])
ret = client.send_message( QueueUrl=conf['writer-sqs-queue'], MessageBody=m['Body'], MessageGroupId=conf['message-group-id'])
print(ret)
client.delete_message(QueueUrl=conf['reader-sqs-queue'], ReceiptHandle=m['ReceiptHandle'])
else:
print('Queue is currently empty or messages are invisible')
break
you can get this script in this link
this script basically can move messages between any arbitrary queues. and it supports fifo queues as well as you can supply the message_group_id field.
That looks like your best option. There is a possibility that your process fails after step 2. In that case you'll end up copying the message twice, but you application should be handling re-delivery of messages (or not care) anyway.
here:
import boto3
import sys
import Queue
import threading
work_queue = Queue.Queue()
sqs = boto3.resource('sqs')
from_q_name = sys.argv[1]
to_q_name = sys.argv[2]
print("From: " + from_q_name + " To: " + to_q_name)
from_q = sqs.get_queue_by_name(QueueName=from_q_name)
to_q = sqs.get_queue_by_name(QueueName=to_q_name)
def process_queue():
while True:
messages = work_queue.get()
bodies = list()
for i in range(0, len(messages)):
bodies.append({'Id': str(i+1), 'MessageBody': messages[i].body})
to_q.send_messages(Entries=bodies)
for message in messages:
print("Coppied " + str(message.body))
message.delete()
for i in range(10):
t = threading.Thread(target=process_queue)
t.daemon = True
t.start()
while True:
messages = list()
for message in from_q.receive_messages(
MaxNumberOfMessages=10,
VisibilityTimeout=123,
WaitTimeSeconds=20):
messages.append(message)
work_queue.put(messages)
work_queue.join()
DLQ comes into play only when the original consumer fails to consume message successfully after various attempts. We do not want to delete the message since we believe we can still do something with it (maybe attempt to process again or log it or collect some stats) and we do not want to keep encountering this message again and again and stop the ability to process other messages behind this one.
DLQ is nothing but just another queue. Which means we would need to write a consumer for DLQ that would ideally run less frequently (compared to original queue) that would consume from DLQ and produce message back into the original queue and delete it from DLQ - if thats the intended behavior and we think original consumer would be now ready to process it again. It should be OK if this cycle continues for a while since we now also get an opportunity to manually inspect and make necessary changes and deploy another version of original consumer without losing the message (within the message retention period of course - which is 4 days by default).
Would be nice if AWS provides this capability out of the box but I don't see it yet - they're leaving this to the end user to use it in way they feel appropriate.
There is a another way to achieve this without writing single line of code.
Consider your actual queue name is SQS_Queue and the DLQ for it is SQS_DLQ.
Now follow these steps:
Set SQS_Queue as the dlq of SQS_DLQ. Since SQS_DLQ is already a dlq of SQS_Queue. Now, both are acting as the dlq of the other.
Set max receive count of your SQS_DLQ to 1.
Now read messages from SQS_DLQ console. Since message receive count is 1, it will send all the message to its own dlq which is your actual SQS_Queue queue.
We use the following script to redrive message from src queue to tgt queue:
filename: redrive.py
usage: python redrive.py -s {source queue name} -t {target queue name}
'''
This script is used to redrive message in (src) queue to (tgt) queue
The solution is to set the Target Queue as the Source Queue's Dead Letter Queue.
Also set Source Queue's redrive policy, Maximum Receives to 1.
Also set Source Queue's VisibilityTimeout to 5 seconds (a small period)
Then read data from the Source Queue.
Source Queue's Redrive Policy will copy the message to the Target Queue.
'''
import argparse
import json
import boto3
sqs = boto3.client('sqs')
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('-s', '--src', required=True,
help='Name of source SQS')
parser.add_argument('-t', '--tgt', required=True,
help='Name of targeted SQS')
args = parser.parse_args()
return args
def verify_queue(queue_name):
queue_url = sqs.get_queue_url(QueueName=queue_name)
return True if queue_url.get('QueueUrl') else False
def get_queue_attribute(queue_url):
queue_attributes = sqs.get_queue_attributes(
QueueUrl=queue_url,
AttributeNames=['All'])['Attributes']
print(queue_attributes)
return queue_attributes
def main():
args = parse_args()
for q in [args.src, args.tgt]:
if not verify_queue(q):
print(f"Cannot find {q} in AWS SQS")
src_queue_url = sqs.get_queue_url(QueueName=args.src)['QueueUrl']
target_queue_url = sqs.get_queue_url(QueueName=args.tgt)['QueueUrl']
target_queue_attributes = get_queue_attribute(target_queue_url)
# Set the Source Queue's Redrive policy
redrive_policy = {
'deadLetterTargetArn': target_queue_attributes['QueueArn'],
'maxReceiveCount': '1'
}
sqs.set_queue_attributes(
QueueUrl=src_queue_url,
Attributes={
'VisibilityTimeout': '5',
'RedrivePolicy': json.dumps(redrive_policy)
}
)
get_queue_attribute(src_queue_url)
# read all messages
num_received = 0
while True:
try:
resp = sqs.receive_message(
QueueUrl=src_queue_url,
MaxNumberOfMessages=10,
AttributeNames=['All'],
WaitTimeSeconds=5)
num_message = len(resp.get('Messages', []))
if not num_message:
break
num_received += num_message
except Exception:
break
print(f"Redrive {num_received} messages")
# Reset the Source Queue's Redrive policy
sqs.set_queue_attributes(
QueueUrl=src_queue_url,
Attributes={
'VisibilityTimeout': '30',
'RedrivePolicy': ''
}
)
get_queue_attribute(src_queue_url)
if __name__ == "__main__":
main()
AWS Lambda solution worked well for us -
Detailed instructions:
https://serverlessrepo.aws.amazon.com/applications/arn:aws:serverlessrepo:us-east-1:303769779339:applications~aws-sqs-dlq-redriver
Github: https://github.com/honglu/aws-sqs-dlq-redriver.
Deployed with a click and another click to start the redrive!
Here is also the script (written in Typescript) to move the messages from one AWS queue to another one. Maybe it will be useful for someone.
import {
SQSClient,
ReceiveMessageCommand,
DeleteMessageBatchCommand,
SendMessageBatchCommand,
} from '#aws-sdk/client-sqs'
const AWS_REGION = 'eu-west-1'
const AWS_ACCOUNT = '12345678901'
const DLQ = `https://sqs.${AWS_REGION}.amazonaws.com/${AWS_ACCOUNT}/dead-letter-queue`
const QUEUE = `https://sqs.${AWS_REGION}.amazonaws.com/${AWS_ACCOUNT}/queue`
const loadMessagesFromDLQ = async () => {
const client = new SQSClient({region: AWS_REGION})
const command = new ReceiveMessageCommand({
QueueUrl: DLQ,
MaxNumberOfMessages: 10,
VisibilityTimeout: 60,
})
const response = await client.send(command)
console.log('---------LOAD MESSAGES----------')
console.log(`Loaded: ${response.Messages?.length}`)
console.log(JSON.stringify(response, null, 4))
return response
}
const sendMessagesToQueue = async (entries: Array<{Id: string, MessageBody: string}>) => {
const client = new SQSClient({region: AWS_REGION})
const command = new SendMessageBatchCommand({
QueueUrl: QUEUE,
Entries: entries.map(entry => ({...entry, DelaySeconds: 10})),
// [
// {
// Id: '',
// MessageBody: '',
// DelaySeconds: 10
// }
// ]
})
const response = await client.send(command)
console.log('---------SEND MESSAGES----------')
console.log(`Send: Successful - ${response.Successful?.length}, Failed: ${response.Failed?.length}`)
console.log(JSON.stringify(response, null, 4))
}
const deleteMessagesFromQueue = async (entries: Array<{Id: string, ReceiptHandle: string}>) => {
const client = new SQSClient({region: AWS_REGION})
const command = new DeleteMessageBatchCommand({
QueueUrl: DLQ,
Entries: entries,
// [
// {
// "Id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
// "ReceiptHandle": "someReceiptHandle"
// }
// ]
})
const response = await client.send(command)
console.log('---------DELETE MESSAGES----------')
console.log(`Delete: Successful - ${response.Successful?.length}, Failed: ${response.Failed?.length}`)
console.log(JSON.stringify(response, null, 4))
}
const run = async () => {
const dlqMessageList = await loadMessagesFromDLQ()
if (!dlqMessageList || !dlqMessageList.Messages) {
console.log('There is no messages in DLQ')
return
}
const sendMsgList: any = dlqMessageList.Messages.map(msg => ({ Id: msg.MessageId, MessageBody: msg.Body}))
const deleteMsgList: any = dlqMessageList.Messages.map(msg => ({ Id: msg.MessageId, ReceiptHandle: msg.ReceiptHandle}))
await sendMessagesToQueue(sendMsgList)
await deleteMessagesFromQueue(deleteMsgList)
}
run()
P.S. The script is with room for improvement, but anyway might be useful.
here is a simple python script you can use from the cli to do the same, depending only on boto3
usage
python redrive_messages __from_queue_name__ __to_queue_name__
code
import sys
import boto3
from src.utils.get_config.get_config import get_config
from src.utils.get_logger import get_logger
sqs = boto3.resource('sqs')
config = get_config()
log = get_logger()
def redrive_messages(from_queue_name:str, to_queue_name:str):
# initialize the queues
from_queue = sqs.get_queue_by_name(QueueName=from_queue_name)
to_queue = sqs.get_queue_by_name(QueueName=to_queue_name)
# begin querying for messages
should_check_for_more = True
messages_processed = []
while (should_check_for_more):
# grab the next message
messages = from_queue.receive_messages(MaxNumberOfMessages=1);
if (len(messages) == 0):
should_check_for_more = False;
break;
message = messages[0]
# requeue it
to_queue.send_message(MessageBody=message.body, DelaySeconds=0)
# let the queue know that the message was processed successfully
messages_processed.append(message)
message.delete()
print(f'requeued {len(messages_processed)} messages')
if __name__ == '__main__':
from_queue_name = sys.argv[1]
to_queue_name = sys.argv[2]
redrive_messages(from_queue_name, to_queue_name)