Not all SQS messages end up in Lambda: most just disappear - amazon-web-services

I have an AWS SQS Queue (standard, non-FIFO) that has a Lambda function as a consumer.
Whenever I send a bunch of messages (usually around 10 at a time) to the queue, only about 2 get picked up by lambda (verified in CloudWatch Logs). The others disappear from the queue.
The Lambda batch size is set to 1, so I would expect all 10 messages to sit in the queue and get picked up by Lambda one by one, but that's not happening. I'm using CloudWatch to check what Lambda is doing, and there is no trace of the missing messages.
I verified in Lambda that it only gets one message every time, by logging the size of the event.Records array (which is always 1).
The Queue also has a Dead Letter Queue. Initially the Maximum Receives was set to 1. When I increased that to 3, more messages were getting picked up after the queues Visibility timeout, but still only a few.
My Queue settings
Visibility timeout: 2 minutes
Delivery delay: 0 seconds
Receive message wait time: 5 seconds
Message retention period: 4 days
Maximum message size: 256kb
I'm wondering why the messages aren't being processed, but instead disappear?

The typical reason why messages are lost is that the Lambda function triggered by Amazon SQS is not correctly processing all Records passed to the function.
Make sure the code loops through all Records passed in the even parameter, since multiple messages can be provided in each Lambda invocation.

Turns out this was related to the Reserved Concurrency of the Lambda function. My concurrency was set to 1, which caused issues.
My expectation
SQS messages will remain in the queue until there's a Lambda function available to pick them up.
In reality
Messages that are not picked up by Lambda because the function is throttled, and after the Visibility Timeout are treated as failed message.
There's an excellent blog post about this issue: https://data.solita.fi/lessons-learned-from-combining-sqs-and-lambda-in-a-data-project/

Related

SQS batching with Lambda event source mapping

I am trying to do some testing with SQS and Lambda and I have set the batch size to 10 and the batch window to 10 on the event source mapping between SQS and a function.
I sent 8 messages in an 8 second window and I expected to see a single function invocation after 10 seconds with all 8 messages in the event but actually what I observed was 4 separate function invocations with differing amounts in the message bodies.
Am I misunderstanding these configuration settings, or is there something on the queue settings which is causing this?
I am posting this in case this catches anyone else out but I did reach out to AWS and received the following response, which I hope might prove useful to other people:
I understand that you would like to use an event source mapping to trigger Lambda functions from SQS. In testing, you have noticed some strange behavior, specifically; by setting the batch size to 10 and sending 8 messages to the queue, you expected to see a single lambda invocation containing the batch of 8 messages, but rather what you observed was 4 lambda invocations. Please feel free to correct me if I have misunderstood at any point.
To explain, in order for Lambda functions with an SQS queue configured as an event source to scale optimally, the following should be true:
- The function isn't producing any errors.
- There are sufficient messages in the SQS queue.
- There is sufficient unreserved concurrency in the AWS Region, or the reserved concurrency for the function is at least 1,000 for standard queues or equivalent to the number of active message groups or higher for FIFO queues.
When messages are available, Lambda reads up to five batches and sends them to your function [1]. When there are fewer items in the queue, the batch will be smaller than the maximum batch size (i.e 10). Therefore, if your SQS queue has fewer than 1,000 messages, you're less likely to receive a full batch of 10 messages in your invocations.
From the above, you would only receive full batches if you have a large queue depth (i.e many messages in the queue - based on the ApproximateNumberOfMessagesVisible metric). As you were testing with a few messages and Lambda reads up to five batches when messages are available, each batch will have fewer than the batch size until there are sufficient messages in the queue to read full batches.

SQS → Lambda Problem With maximumBatchingWindow

Our intention is to trigger a lambda when messages are received in an SQS queue.
we only want one invocation of the lambda to run at a time (maximum concurrency of one)
We would like for the lambda to be triggered every time one of the following is true:
There are 10,000 messages in the queue
Five minutes has passed since the last invocation of the lambda
Our consumer lambda is dealing with an API with limited API calls and strict concurrency limits. The above solution ensures we never encounter concurrency issues and we can batch our calls together, ensuring we never consume too many API calls.
Here is our serverless.yml configuration
functions:
sqs-consumer:
name: sqs-consumer
handler: handlers.consume_handler
reservedConcurrency: 1 // maximum concurrency of 1
events:
- sqs:
arn: !GetAtt
- SqsQueue
- Arn
batchSize: 10000
maximumBatchingWindow: 300
timeout: 900
resources:
Resources:
SqsQueue:
Type: 'AWS::SQS::Queue'
Properties:
QueueName: sqs-queue
VisibilityTimeout: 5400 # 6x greater than the lambda timeout
The above does not give us the desired behavior. We are seeing our lambda triggered every 1 to 3 minutes (instead of 5). It indeed is using batches because we’ll see multiple messages being processed in a single invocation, but with even just one or two messages in the queue at a time it doesn’t wait 5 minutes to trigger the lambda.
Our messages are extremely small, so it's not possible we're coming anywhere close to the 6mb limit.
We would expect the only time the lambda is triggered to be when either 10,000 messages have accumulated in the queue or five minutes have transpired since the previous invocation. Instead we are seeing the lambda invoked anywhere in between every 1 to 3 minutes with a batch size that never even breaks 100, much less 10,000.
The largest batch size I’ve seen it invoke the lambda with so far has been 28, and sometimes with only one message in the queue it’ll invoke the function when it’s only been one minute since the previous invocation.
We would like to avoid using Kinesis, as the volume we’re dealing with truly doesn’t warrant it.
Reply from AWS Support:
As per the Case ID 10802672001, I understand that you have an SQS
event source mapping on Lambda with a batch size of 500 and batch
Window of 60 seconds. I further understand that you have observed the
lambda function invocation has fewer messages than 500 in a batch and
is not waiting for batch window time configured while receiving the
messages. You would like to know why lambda is being invoked prior to
meeting any of the above configured conditions and seek our assistance
in troubleshooting the same. Please correct me if I misunderstood your
query by any means.
Initially, I would like to thank you for sharing the detailed
correspondence along with the screenshot of the logs, it was indeed
very helpful in troubleshooting the issue.
Firstly, I used the internal tools to check the configuration of your
lambda function "sd_dch_archivebatterydata" and observed that there
is no throttling in the lambda function and there is no reserved
concurrency configured. As you might already be aware that Lambda is
meant to scale while polling from SQS queues and thus it is
recommended not to use reserving concurrency, as it is going against
the design of the event source. On checking log screenshot shared by
you, I observed there were no errors.
Regarding your query, please allow me to answer them as follows:
Please understand here that Batch size is the maximum number of messages that lambda will read from the queue in one batch for a
single invocation. It should be considered as the maximum number of
messages (up to) that can be received in a single batch but not as a
fixed value that can be received at all times in a single invocation.
-- Please see "When Lambda invokes the target function, the event can contain multiple items, up to a configurable maximum batch size" in
the official documentation here [1] for more information on the same.
I would also like to add that, according to the internal architecture of how the SQS service is designed, Lambda pollers will
poll the messages from the queue using the "ReceiveMessage" API
calls and invokes the Lambda function.
-- Please refer the documentation [2] which states the following "If the number of messages in the queue is small (fewer than 1,000), you
most likely get fewer messages than you requested per ReceiveMessage
call. If the number of messages in the queue is extremely small, you
might not receive any messages in a particular ReceiveMessage
response. If this happens, repeat the request".
-- Thus, we can see that the number of messages that can be obtained in a single lambda invocation with a certain batch size depends on the
number of messages in an SQS queue and the SQS service internal
implementation.
Also, batch window is the maximum amount of time that the poller waits to gather the messages from the queue before invoking the
function. However, this applies when there are no messages in the
queue. Thus, as soon as there is a message in the queue, the Lambda
function will be invoked without any further due without waiting for
the batch window time specified. You can refer to the
"WaitTimeSeconds" parameter in the "ReceiveMessage" API.
-- The batch window just ensures that lambda starts polling after certain time so that enough messages are present in the queue.
However, there are other factors like size of messages, incoming
volume, etc that can affect this behavior.
Additionally, I would like to confirm that Polls from SQS in Lambda is of Synchronous invocation type and it has an invocation payload
limit size of 6MB. Please refer the following AWS Documentation for
more information on the same [3].
Having said that, I can confirm that this Lambda polling behaviour is
by design and not a bug. Please be rest assured that there are no
issues with the lambda and SQS service.
Our scenario is to archive to S3, and we want fewer larger files. Looks like our options are potentially kinesis, or running a custom receive application on something like ECS...

Delay in getting messages from AWS SQS

I am adding messages in SQS on Lambda and then receiving the messages inside a container on ECS.
The problem is there is a 10-15 seconds of delay when I am receiving the messages on the container.
On the container a loop is running indefinitely every 1 second where I am getting messages and if available processing it.
Example:
Suppose the message is added in SQS at 15:20:00 but I am able to get that message at 15:20:15 on ECS. These 15 seconds are too long for my use case.
Can this time be reduced ?
Assuming that there are multiple producers and consumers is there any alternative solution ?
If your workers are continually polling the Amazon SQS queue, they can reduce the amount of requests by specifying WaitTimeSeconds=20 (which is its maximum value).
This tells Amazon SQS to wait until at least one message is available, to a maximum of 20 seconds. If no messages are available after 20 seconds, an empty set of messages is returned. However, if one or more messages appear in the queue, then the call returns immediately without waiting for 20 seconds.
This reduces the frequency of calls to SQS and might increase stability in your application.

How to space apart retries of consumer/lambda processing of SQS message to being 10 hours apart

I would like to understand the limits with respect to how long consumer message processing attempts can be spaced apart. For example, suppose I have the following AWS Resources
SQS Queue (named "SQSQueueName1") w/ redrive configured to send dead letter messages to SQSQueueName1DLQ
SQS Queue DLQ (named "SQSQueueName1DLQ")
Lambda Function (named "LambdaName1")
If SQSQueueName1 has a redrive policy with MaxRecieveCount set to 10, how long are the attempts by the consumer to process this message spaced apart in this scenario? Is there any control I have over the duration of time between consumer attempts? For example, can I space them apart such that attempts happen within 10 hours? Or is this control completely non-existant such that all control is delegated to the negotiation between the lambda pollers and the sqs (using visibility timeout + redrive)?
Again, my goal is to see if its technically possible to control the amount of time between invocations to a set amount of time, say 10 hours. 24 hours.
SQS queues have messageVisibilityTimeout parameter that controls exactly what you want. It is set to a duration, with max value 12 hours. After a message is read by a consumer, the message will be invisible to anyone else for the duration of messageVisibilityTimeout. So if you set it to 10 hours, your message will only retried after 10 hours.
Lambda triggers is not related to this parameter at all. When you trigger a Lambda function with an SQS, Lambda does a long poll to the SQS queue, in other words asks for new messages constantly. However, regardless of how many requests Lambda makes to SQS, if the message is not visible, Lambda won't read it.

Optimizing SQS Lambda configuration for single concurrency microservice composition

Apologies for the title. It's hard to summarise what I'm trying to accomplish.
Basically let me define a service to be an SQS Queue + A Lambda function.
A service (represented by square brackets below) performs a given task, where the queue is the input interface, processes the input, and outputs on to the queue of the subsequent service.
Service 1 Service 2 Service 3
[(APIG) -> (Lambda)] -> [(SQS) -> (Lambda)] -> [(SQS) -> (Lambda)] -> ...
Service 1: Consumes the request and payload, splits it into messages and passes on to the queue of the next service.
Service 2: This service does not have a reserved concurrency. It validates each message on the queue, and if valid, passes on to the next service.
Service 3: Processes each message in the queue (ideally in batches of approximately 100). The lambda here must have a reserved concurrency of 1 (as it hits an API that can't process multiple requests concurrently).
Currently I have the following configuration on Service 3.
Default visibility timeout of queue = 5 minutes
Lambda timeout = 5 minutes
Lambda reserved concurrency = 1
Problem 1: Service 3 consumes x items off the queue and if it finishes processing them within 30 seconds I expect the queue to process the next x items off the queue immediately (ideally x=100). Instead, it seems to always wait 5 minutes before taking the next batch of messages off the queue, even if the lambda completes in 30 seconds.
Problem 2: Service 3 typically consumes a few messages at a time (inconsistent) rather than batches of 100.
A couple of more notes:
In service 3 I do not explicitly delete messages off the queue using the lambda. AWS seems to do this itself when the lambda successfully finishes processing the messages
In service 2 I have one item per message. And so when I send messages to Service 3 I can only send 10 items at a time, which is kind of annoying. Because queue.send_messages(Entries=x), len(x) cannot exceed 10.
Does anyone know how I solve Problem 1 and 2? Is it an issue with my configuration? If you require any further information please ask in comments.
Thanks
Both your problems and notes indicate misconfigured SQS and/or Lambda function.
In service 3 I do not explicitly delete messages off the queue using
the lambda. AWS seems to do this itself when the lambda successfully
finishes processing the messages.
This is definitely not the case here as it would go agains the reliability of SQS. How would SQS know that the message was successfully processed by your Lambda function? SQS doesn't care about consumers and doesn't really communicate with them and that is exactly the reason why there is a thing such as visibility timeout. SQS deletes message in two cases, either it receives DeleteMessage API call specifying which message to be deleted via ReceiptHandle or you have set up redrive policy with maximum receive count set to 1. In such case, SQS will automatically send message to dead letter queue when if it receives it more than 1 time which means that every message that was returned to the queue will be send there instead of staying in the queue. Last thing that can cause this is a low value of Message Retention Period (min 60 seconds) which will drop the message after x seconds.
Problem 1: Service 3 consumes x items off the queue and if it finishes
processing them within 30 seconds I expect the queue to process the
next x items off the queue immediately (ideally x=100). Instead, it
seems to always wait 5 minutes before taking the next batch of
messages off the queue, even if the lambda completes in 30 seconds.
This simply doesn't happen if everything is working as it should. If the lambda function finishes in 30 seconds, if there is reserved concurrency for the function and if there are messages in the queue then it will start processing the message right away.
The only thing that could cause is that your lambda (together with concurrency limit) is timing out which would explain those 5 minutes. Make sure that it really finishes in 30 seconds, you can monitor this via CloudWatch. The fact that the message has been successfully processed doesn't necessarily mean that the function has returned. Also make sure that there are messages to be processed when the function ends.
Problem 2: Service 3 typically consumes a few messages at a time
(inconsistent) rather than batches of 100.
It can never consume 100 messages since the limit is 10 (messages in the sense of SQS message not the actual data that is stored within the message which can be anywhere up to 256 KB, possibly "more" using extended SQS library or similar custom solution). Moreover, there is no guarantee that the Lambda will receive 10 messages in each batch. It depends on the Receive Message Wait Time setting. If you are using short polling (1 second) then only subset of servers which are storing the messages will be polled and a single message is stored only on a subset of those servers. If those two subsets do not match when the message is polled, the message is not received in that batch. You can control this by increasing polling interval, Receive Message Wait Time, (max 20 seconds) but even if there are not enough messages in the queue when the timer finishes, the batch will still be received with fewer messages, possibly zero.
And as it was mentioned in the comments, using this strategy with concurrency set to low number can lead to some problems. Another thing is that you need to ensure that rate at which messages are produced is somehow consistent with the time it takes for one instance of lambda function to process the message otherwise you will end up with constantly growing queue, possibly losing messages after they outlive the Message Retention Period.