In case of lambda workers processing batches from an SQS queue, Is there an option to monitor the worker's failure rate (wrt processing job) and block further dequeueing (and as a result, lambda invocations) in case failure rate crosses a threshold? I can monitor lambda's error/invocation rate, but how would the dequeue halting be implemented? I don't want to empty the queue and lose the data.
First thing is to understand why your Lambda could (possibly) be failing:
1) If they are failing because of throttling (more messages to be processed than available Lambda functions), the message (or the whole batch) will be sent back to the Queue and will be tried again once the Visibility Timeout expires, so the retry logic is already built-in for you and scales well.
2) If they are failing because of bad messages or some error in the code, you can configure a DLQ to send the failed messages to. This is easy to setup as you only need to tell your Lambda function which DLQ to connect to in case of failure.
If you scenario is 1), rest assured your messages won't be lost. If your scenario is 2), just configure a DLQ for further analysis of the failed messages.
You can also check the official docs to understand Lambda's Retry Behaviour
Related
I came across an AWS article where it is mentioned only once delivery of a message is not guaranteed when the FIFO queue is used with a lambda trigger.
Amazon SQS FIFO queues ensure that the order of processing follows the message order within a message group. However, it does not guarantee only once delivery when used as a Lambda trigger. If only once delivery is important in your serverless application, it’s recommended to make your function idempotent. You could achieve this by tracking a unique attribute of the message using a scalable, low-latency control database like Amazon DynamoDB.
I am more interested in knowing the reason behind this behaviour when it comes to lambda trigger. I assume, with standard queues only once delivery is not guaranteed since SQS stores messages in multiple servers for redundancy and high availability and there is a chance of same message getting delivered again while multiple lambdas polling the queue.
Can someone please explain the reason for the same behaviour in FIFO queue with lambda trigger or the working internally?
By default lambda polls synchronously from SQS. So when lambda processes messages from the queue they become invisible i.e Visibility timeout gets triggered till the lambda either finishes the process to eventually delete them from the queue or fails to retry them again.
That's why lambda cannot guarantee exactly-once delivery since there can be a retry in lambda cause of timeout (15min max) or other code dependency errors.
To prevent this you either make your process idempotent or use Batch response to delete the message even in case of failure.
I have lambda using SQS events as inputs. The SQS queue also has a DLQ.
The lambda function invokes a downstream Restful API (call this operation DoPostToAPI())
I need to guarantee that the lambda function attempts to call DoPostToAPI() at least 2 times (before message goes to DLQ)
What configuration of Lambda Retries and SQS Redrive policy would I need to set in order to accomplish the above requirement?
I need to be 100% certain that messages that arrive on the DLQ only arrive because they have attempted to been sent to downstream API DoPostToAPI() 2 times, and that messages dont arrive in DLQ for any other reason, if possible.
To me, it makes sense that messages should only arrive on the DLQ if the operation was attempted, and not for other reasons (i.e. I dont want messages to arrive on DLQ purely because of throttling, since the DoPostToAPI() should be attempted first before sending to DLQ) Why would I want messages on DLQ if the lambda function operation wasnt even attempted? In order words, I need the lambda operation to be guaranteed to be invoked before item moves to DLQ.
Can I get some help on this? Is it possible to guarantee that messages on the DLQ have arrived because of failed DoPostToAPI() api calls? Or is it (more unfortunate) possible that messages arrive on DLQ for reasons other than failed calls to downstream API?
From what I have read online so far, its possible that lambda , after doing receive on SQS message and moving the message to invisibile on the queue, could run into throttling issues and re-attempt the lambda invocation. But if it runs into lambda throttling again, it could end up back on main queue, which if it reaches its max receive count, could place the message on the DLQ without the lambda having been attempted at all. Is this correct?
For simplicity lets imagine the following inputs
SQSQueue1
SQSQueue1DLQ
LambdaFunction1 --> ServiceClient1.DoPostToAPI()
What is the interplay between the lambda "maximum_retry_attempts" and the SQS redrive_policy "maxReceiveCount"
In order to ensure your lambda attempts retries when using SQS, You only need set the SQS property
maxReceiveCount
This value controls how many lambda invocations will be attempted for a given batch before a message goes to the Dead Letter queue.
Unfortunately, the lambda property
maximum_retry_attempts
Does not apply for lambda functions using SQS as function event trigger.
My Lambda configuration is as below
Lambda Concurrency is set to 50
And SQS trigger batch size is set to 1
Issue:
When my queue is flooded with 200+ messages, some of the sqs triggers are missed and the message from the queue goes to inflight state without even triggering the lambda. This is adding a latency in processing by the timeout value set for lambda as I need to wait for the message to come out of flight for it to be reprocessed.
Any inputs will be highly appreciated.
SQS is integrated with Lambda through event source mappings.
Thanks to the mappings, the Lambda service is long polling the SQS queue, and invoking your function on your behalf. What's more it automatically removes the messages from the queue if your Lambda successfully processes them.
Since you want to process 200+ messages, and you set concurrency to 50 with batch size of 1, it means that you can process only 50 messages in parallel. The rest will be throttled. When this happens:
If your function is throttled, returns an error, or doesn't respond, the message becomes visible again. All messages in a failed batch return to the queue, so your function code must be able to process the same message multiple times without side effects.
To rectify the issue, the following two immediate actions can be considered:
increase concurrency of your function to 200 or more.
increase batch size to 10. With the batch size and concurrency of 50, you can process 500 (10 x 50) messages concurrently.
Also since you are heavily throttled, setting up a dead-letter queue can be useful. The DLQ helps captures problematic or missed messages from the queue, so that you can process them later or inspect:
If a message fails to be processed multiple times, Amazon SQS can send it to a dead-letter queue. When your function returns an error, Lambda leaves it in the queue. After the visibility timeout occurs, Lambda receives the message again. To send messages to a second queue after a number of receives, configure a dead-letter queue on your source queue.
I have deployed a AWS Lambda function that triggers when a SQS queue receives a message. The function makes a request to a Rest API and if the response is not Ok the SQS message needs to be processed again.
That's why I need to resend the message to the queue but I would prefer to delete the SQS messages programatically, although I can't find how to configure SQS. I have tried message retention but it seems the trigger event causes the message being deleted anyway.
Other possible options could be back up the message in S3 or persisting it in DynamoDB but I wonder if there's a better option.
Any insights on this question would be very helpful.
From AWS Lambda Retry Behavior - AWS Lambda:
If you configure an Amazon SQS queue as an event source, AWS Lambda will poll a batch of records in the queue and invoke your Lambda function. If the invocation fails or times out, every message in the batch will be returned to the queue, and each will be available for processing once the Visibility Timeout period expires. (Visibility timeouts are a period of time during which Amazon Simple Queue Service prevents other consumers from receiving and processing the message).
Once an invocation successfully processes a batch, each message in that batch will be removed from the queue. When a message is not successfully processed, it is either discarded or if you have configured an Amazon SQS Dead Letter Queue, the failure information will be directed there for you to analyze.
So, it seems (from reading this) that a simple option would be set a high visibility timeout on the queue and then raise an error if the function cannot process the message. This message will remain invisible for the configured timeout period, then would reappear on the queue for processing. If it exceeds the permitted number of retries, it would be deleted or moved to a Dead Letter Queue (if configured).
There is a lambda-powertools library created and maintained by AWSLabs and one of the feature is batch processing.
The batch processing utility handles partial failures when processing
batches from Amazon SQS, Amazon Kinesis Data Streams, and Amazon
DynamoDB Streams.
Check out the documentation here. This is the python version, but there are versions for other environments.
So after some research I found the following:
Frankly there was an workaround options to selectively filter out messages processed as good ones from a batch - before aws implemented it.
Kindly refer to approaches 1-3 demonstrated in here
As for using aws's implementation use approach No.4
Simple question:
I want to run an autoscale group on Amazon, which fires up multiple instance which processes the messages from a SQS queue. But how do I know that the instances aren't processing the same messages?
I can delete a message from the queue when it's processed. But if it's not deleted yet and still being processed by an instance, another instance CAN download that same message and processing it also, to my opinion.
Aside from the fairly remote possibility of SQS incorrectly delivering the same message more than once (which you still need to account for, even though it is unlikely), I suspect your question stems from a lack of familiarity with SQS's concept of "visibility timeout."
Immediately after the component receives the message, the message is still in the queue. However, you don't want other components in the system receiving and processing the message again. Therefore, Amazon SQS blocks them with a visibility timeout, which is a period of time during which Amazon SQS prevents other consuming components from receiving and processing that message.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/AboutVT.html
This is what keeps multiple queue runners from seeing the same message. Once the visibility timeout expires, the message will be delivered again to a queue consumer, unless you delete it, or it exceeds the maximum configured number of deliveries (at which point it's deleted or goes into a separate dead letter queue if you have configured one). If a job will take longer than the configured visibility timeout, your consumer can also send a request to SQS to change the visibility timeout for that individual message.
Update:
Since this answer was originally written, SQS has introduced FIFO Queues in some of the AWS regions. These operate with the same logic described above, but with guaranteed in-order delivery and additional safeguards to guarantee that occasional duplicate message delivery cannot occur.
FIFO (First-In-First-Out) queues are designed to enhance messaging between applications when the order of operations and events is critical, or where duplicates can't be tolerated. FIFO queues also provide exactly-once processing but are limited to 300 transactions per second (TPS).
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.html
Switching an application to a FIFO queue does require some code changes, and requires that a new queue be created -- existing queues can't be changed over to FIFO.
You can receive duplicate messages, but only "on rare occasions". And so you should aim for idempotency.
An instance can receive duplicate messages only once the SQS visibility time out has expired. By default the visibility timeout is 30 seconds. So you have 30 seconds to make sure that your processing is done, else other instances may welcome new messages.
See AWS SQS Timeout for timeout details.