I have setup an SQS, DeadLetterQue (DLQ) and a lambda. I am having some questions regarding the body format that the SQS receives.
After some tries, I can see some messages are ending up in a DLQ.
I have read many documentations, but really I still can't figure some basic things.
My task is fairly simple. I will use Axios to send some payload to external API. That API might give me an error, or not available at that time. This code is very simple, but at the end it will include Axios implementation:
exports.handler = async (event, context) => {
event.Records.forEach(record => {
const { body } = record;
const { messageAttributes } = record;
console.log(body)
console.log(messageAttributes);
});
let somethingWentWrongWithApi = new Error();
throw somethingWentWrongWithApi;
};
After this is run, after some time I see message in DLQ, and original SQS is empty, as I expected.
In real code, I will have catch block. And inside it I will throw the error, exactly as in example. Lambda will try to execute it three times (I am not sure where I got this information). Then, it will return it to SQS, where in turn the SQS will push it to DLQ.
I wonder, should I implement in Lambda retries, and after three times throw the error... or just throw it and rely on existing process (three times retrial)?
I am still reading about different setting on both SQS and Lambda.
To extend execution tries you have to go to:
SQS service;
Right click on your SQS;
Choose a configure Queue option;
Under the Dead Letter Queue Settings line you can see a
Maximum Receives property (which you can set between 1 and 1000).
About errors. If something going wrong on lambda side then lambda throwing an error, you can check what kind of errors throwed with cloudwatch.
Go to aws CloudWatch;
On the right panel find Logs;
Chose Log Group.
If you have any additional questions just summon me. I'll try to answer them and I'll extend that answer.
Related
I'm working on a solution where I have a SQS queue with Lambda trigger. My understanding is Lambda will receive messages in batches to be processed, and once Lambda function is successful, the messages in the SQS queue is automatically deleted. However, how do I only allow some of those messages to be deleted?
Let's assume this use case:
Lambda function receives a batch with 10 messages, and only 7 messages are valid and can be processed, and the other 3 messages needs to be reprocessed at later point.
My initial thought was I could update the visibility timeout via boto3.sqs.change_visibility_timeout for each of the 3 messages to have it reprocessed after the timeout, however, since overall lambda function execution is successful, all 10 messages are deleted from SQS queue.
Any suggestions?
Yes, by default, the Lambda function deletes all the messages upon success. You would need to handle this in your code, but not by changing the visibility timeout of the messages.
Add DLQ (dead-letter queue) that will actually handle the failed messages (messages go to DLQ after a certain number of failed attempts to be processed, depending on how you set it up)
You have few options here:
You can handle each item yourself, and delete messages that are processed successfully. In case of a message that's not successful, you can throw an error and it won't be deleted automatically by the lambda function
If you use JavaScript you can try with Middy
If you use Python, you can use Lambda Powertools Python
For AWS Lambdas with an SQS trigger, by default, when your function encounters an error processing one or more messages in a given batch, the entire batch is marked as a failure. All of the messages in the batch are made visible again in the queue. Depending on your redrive policy, you can end up repeatedly processing successful messages along with the failures.
Rather than change the visibility timeout, the simplest way to specify which messages should be retried later and which can safely be deleted from the queue is to change the function response type to ReportBatchItemFailures. This allows you to return a list of failed message ids, indicating that only those messages in the batch should be made visible again in the queue.
Here's what the reporting syntax looks like for a handler function in Node.js:
exports.handler = async (event) => {
// Process the event
const batchItemFailureResponse = {
batchItemFailures: [
{
itemIdentifier: "idFailedMessage1"
},
{
itemIdentifier: "idFailedMessage2"
},
{
itemIdentifier: "idFailedMessage3"
}
]
};
return batchItemFailureResponse;
};
There is more information to be found in the official documentation.
This response type is configured when setting a queue as an event source for the Lambda. If you're configuring from the console, navigate to the Lambda function page, select the Configuration tab, and then choose Triggers. Then choose Add Trigger and choose the SQS trigger type. In addition to providing the standard parameters, be sure to check the box under Report batch item failures after expanding Additional Settings. It should look something like this:
Add trigger with batch failure reporting
This parameter must be set when first creating the trigger.
This response type can also be defined if you use CloudFormation templates to provision your resources. See the AWS documentation for more information. Note that if you use AWS SAM event source mappings, the documentation suggests that adding FunctionResponseTypes to the YAML with ReportBatchItemFailures in the type list isn't supported. That is incorrect, the documentation is simply outdated. There is an open issue around addressing this oversight.
Finally, in addition to reporting batch item failures, you should provision a target DLQ (dead-letter queue) and determine a reasonable maximum receive count so that action can be taken on messages that fail repeatedly.
I'm looking at the the AWS SQS documentation here: https://docs.aws.amazon.com/sdk-for-net/v3/developer-guide/ReceiveMessage.html#receive-sqs-message
My understanding is that we need to delete the message using AmazonSQSClient.DeleteMessage() once we're done processing it, but is this necessary when we're working with an SQS triggered Lambda?
I'm testing with a Lambda function that's triggered by an SQSEvent, and unless I'm mistaken, it appears that if the Lambda function runs to completion without throwing any errors, the message does NOT return to the SQS queue. If this is true, the I would rather avoid making that unnecessary call to AmazonSQSClient.DeleteMessage().
Here is a similar question from 2019 with the top answer saying that the SDK does not delete messages automatically and that they need to be explicitly deleted within the code. I'm wondering if anything has changed since then.
Thoughts?
The key here is that you are using the AWS Lambda integration with SQS. In that instance AWS Lambda handles retrieving the messages from the queue (making them available via the event object), and automatically deletes the message from the queue for you if the Lambda function returns a success status. It will not delete the message from the queue if the Lambda function throws an error.
When using AWS Lambda integration with SQS you should not be using the AWS SDK to interact with the SQS queue at all.
Update:
Lambda now supports partial batch failure for SQS whereby the Lambda function can return a list of failed messages and only those will become visible again.
Amazon SQS doesn't automatically delete a message after retrieving it for you, in case you don't successfully receive the message (for example, if the consumers fail or you lose connectivity). To delete a message, you must send a separate request which acknowledges that you've successfully received and processed the message.
This has not changed and likely won’t change in the future as there us no way for SQS to definitively know in all cases if messages have successfully been processed. If SQS started to “assume” what happens downstream it risk becoming unreliable in many scenarios.
Yes, otherwise the next time you ask for a set of messages, you will get the same messages back - maybe not on the next call, but eventually you will. You likely don't want to keep processing the same set of messages over and over.
I'm implementing my own webhooks service which will send out events to subscribed webhooks.
Overview of architecture:
events are pushed onto an SQS queue
a lambda function is triggered by SQS messages (event source mapping)
for each event, I make outgoing http requests to subscribed webhooks
non-2xx responses must be retried with exponential backoff (in such event, I change the message visibility on the received message)
since lambdas that are invoked by SQS will automatically delete the message upon completion I throw an error at the end of the function to prevent the automatic delete
As far as I can tell, the call to change the message visibility is succeeding. I'm wondering if there's something else baked into lambdas that are invoked by SQS. Upon failure from the lambda, is it internally changing the message visibility again? Or do lambdas that are invoked by SQS not respect message visibility changes (this really doesn't make any sense to me). Curious if anyone has any insight into this problem. I was quite surprised to find out that lambda automatically deletes messages upon success since it makes my particular use-case a little clunky feeling - throwing an error to fail the lambda function to prevent the message from being deleted.
Thanks in advance!
The nature of the SQS integration with Lambda is that the integration controls the polling of the messages. The mechanism used to determine if the message should be deleted is that the response from the lambda is not an error. It's not clearly stated in the documentation, but I believe when an error occurs the integration is going to set the visibility timeout to zero in order to make it immediately available for another process to pick it up. So in your example you are setting it to some number that allows you to retry, but when you return an error the integration is setting the timeout back to zero. If you need to have more control over that process you probably should not use the integration.
Update: This is actually not the case. I was failing to properly wait for the call to adjust the timeout to finish. As such the lambda was closing before that request was completing. The timeout I'm setting on the message within the lambda is being respected. I'm then throwing an error to prevent the message from being deleted.
Since SQS triggers for lambdas process messages in batches, using await in a standard forEach doesn't work (forEach does not await the callbacks if they are async). To bypass that, you can create your own version of an async forEach:
var AWS = require('aws-sdk');
exports.handler = async function(event, context) {
var sqs = new AWS.SQS({apiVersion: '2012-11-05'});
await asyncForEach(event.Records, async record => {
const { receiptHandle } = record;
const sqsParams = {
QueueUrl: '<YOUR_QUEUE_URL>', /* required */
ReceiptHandle: receiptHandle, /* required */
VisibilityTimeout: '<in seconds>' /* required */
};
try {
let res = await sqs.changeMessageVisibility(sqsParams).promise();
console.log(res);
} catch (err) {
console.log(err, err.stack);
throw new Error('Fail.');
}
}
});
return {};
};
async function asyncForEach(array, callback) {
for (let index = 0; index < array.length; index++) {
await callback(array[index], index, array);
}
}
I'm trying to design a small message processing system based on SQS, Lambda, and SNS. In case of failure, I'd like for the message to be enqueued in a Dead Letter Queue (DLQ) and for a webhook to be called.
I'd like to know what the most canonical or reasonable way of achieving that would look like.
Currently, if everything goes well, the process should be as follows:
SQS (in place to handle retries) enqueues a message
Lambda gets invoked by SQS and processes the message
Lambda sends a webhook and finishes normally
If something in the lambda goes wrong (success webhook cannot be called, task at hand cannot be processed), the easiest way to achieve what I want seems to be to set up a DLQ1 that SQS would put the failed messages in. An auxiliary lambda would then be called to process this message, pass it to SNS, which would call the failure webhook, and also forward the message to DLQ2, the final/true DLQ.
Is that the best approach?
One alternative I know of is Alarms, though I've been warned that they are quite tricky. Another one would be to have lambda call the error reporting webhook if there's a failure on the last retry, although that somehow seems inappropriate.
Thanks!
Your architecture looks good enough in case of success, but I personally find it quite confusing if anything goes wrong as I don't see why you need two DLQs to begin with.
Here's what I would do in case of failure:
Define a DLQ on your source SQS Queue and set the maxReceiveCount to e.g. 3, meaning if messages fail three times, they will be redirected to the configured DLQ
Create a Lambda that listens to this DLQ.
Execute the webhook inside this Lambda.
Since step 3 automatically deletes the message from the Queue once it has been processed and, apparently, you want the messages to be persisted somewhere, store the content of the message in a file on S3 and store the file metadata (bucket and key) in a table in DynamoDB, so you can always query for failed messages.
I don't see any role for SNS here unless you want multiple subscribers for a given message, but as I see this is not the case.
This way, you need need to maintain only one DLQ and you can get rid of SNS as it's only adding an extra layer of complexity to your architecture.
I am wanting to execute a http request inside of a lambda function, invoked by API Gateway. The problem is, the request takes a bit of time to complete (<20 seconds) and don't want the client waiting for a response. In my research on asynchronous requests, I learned that I can pass the X-Amz-Invocation-Type:Event header to make the request execute asynchronously, however this isn't working and the code still "waits" for the http request to complete.
Below is my lambda code:
'use strict';
const https = require('https');
exports.handler = function (event, context, callback) {
let requestUrl;
requestUrl = event.queryStringParameters.url;
https.get(requestUrl, (res) => {
console.log('statusCode:', res.statusCode);
console.log('headers:', res.headers);
res.on('data', (d) => {
process.stdout.write(d);
});
}).on('error', (e) => {
console.error(e);
});
let response = {
"statusCode": 200,
"body": JSON.stringify(event.queryStringParameters)
};
callback(null, response);
};
Any help would be appreciated.
You can use two Lambda functions.
Lambda 1 is triggered by API Gateway then calls Lambda 2 asynchronously (InvocationType = Event) then returns a response to the user.
Lambda 2, once invoked, will trigger the HTTP request.
Whatever you do, don't use two lambda functions.
You can't control how lambda is being called, async or sync. The caller of the lambda decides that. For APIGW, it has decided to call lambda sync.
The possible solutions are one of:
SQS
Step Functions (SF)
SNS
In your API, you call out to one of these services, get back a success, and then immediately return a 202 to your caller.
If you have a high volume of single or double action execution use SQS. If you have potentially long running with complex state logic use SF. If you for someone reason want to ignore my suggestions, use SNS.
Each of these can (and should) call back out to a lambda. In the case that you need to run more than 15 minutes, they can call back out to CodeBuild. Ignore the name of the service, it's just a lambda that supports up to 8 hour runs.
Now, why not use two lambdas (L1, L2)? The answer is simple. Once you respond that your async call was queued (SQS, SF, SNS), to your users (202). They'll expect that it works 100%. But what happens if that L2 lambda fails. It won't retry, it won't continue, and you may never know about it.
That L2 lambda's handler no longer exist, so you don't know the state any more. Further, you could try to add logging to L2 with wrapper try/catch but so many other types of failures could happen. Even if you have that, is CloudWatch down, will you get the log? Possible not, it just isn't a reliable strategy. Sure if you are doing something you don't care about, you can build this arch, but this isn't how real production solutions are built. You want a reliable process. You want to trust that the baton was successfully passed to another service which take care of completing the user's transaction. This is why you want to use one of the three services: SQS, SF, SNS.