SQS → Lambda Problem With maximumBatchingWindow - amazon-web-services

Our intention is to trigger a lambda when messages are received in an SQS queue.
we only want one invocation of the lambda to run at a time (maximum concurrency of one)
We would like for the lambda to be triggered every time one of the following is true:
There are 10,000 messages in the queue
Five minutes has passed since the last invocation of the lambda
Our consumer lambda is dealing with an API with limited API calls and strict concurrency limits. The above solution ensures we never encounter concurrency issues and we can batch our calls together, ensuring we never consume too many API calls.
Here is our serverless.yml configuration
functions:
sqs-consumer:
name: sqs-consumer
handler: handlers.consume_handler
reservedConcurrency: 1 // maximum concurrency of 1
events:
- sqs:
arn: !GetAtt
- SqsQueue
- Arn
batchSize: 10000
maximumBatchingWindow: 300
timeout: 900
resources:
Resources:
SqsQueue:
Type: 'AWS::SQS::Queue'
Properties:
QueueName: sqs-queue
VisibilityTimeout: 5400 # 6x greater than the lambda timeout
The above does not give us the desired behavior. We are seeing our lambda triggered every 1 to 3 minutes (instead of 5). It indeed is using batches because we’ll see multiple messages being processed in a single invocation, but with even just one or two messages in the queue at a time it doesn’t wait 5 minutes to trigger the lambda.
Our messages are extremely small, so it's not possible we're coming anywhere close to the 6mb limit.
We would expect the only time the lambda is triggered to be when either 10,000 messages have accumulated in the queue or five minutes have transpired since the previous invocation. Instead we are seeing the lambda invoked anywhere in between every 1 to 3 minutes with a batch size that never even breaks 100, much less 10,000.
The largest batch size I’ve seen it invoke the lambda with so far has been 28, and sometimes with only one message in the queue it’ll invoke the function when it’s only been one minute since the previous invocation.
We would like to avoid using Kinesis, as the volume we’re dealing with truly doesn’t warrant it.

Reply from AWS Support:
As per the Case ID 10802672001, I understand that you have an SQS
event source mapping on Lambda with a batch size of 500 and batch
Window of 60 seconds. I further understand that you have observed the
lambda function invocation has fewer messages than 500 in a batch and
is not waiting for batch window time configured while receiving the
messages. You would like to know why lambda is being invoked prior to
meeting any of the above configured conditions and seek our assistance
in troubleshooting the same. Please correct me if I misunderstood your
query by any means.
Initially, I would like to thank you for sharing the detailed
correspondence along with the screenshot of the logs, it was indeed
very helpful in troubleshooting the issue.
Firstly, I used the internal tools to check the configuration of your
lambda function "sd_dch_archivebatterydata" and observed that there
is no throttling in the lambda function and there is no reserved
concurrency configured. As you might already be aware that Lambda is
meant to scale while polling from SQS queues and thus it is
recommended not to use reserving concurrency, as it is going against
the design of the event source. On checking log screenshot shared by
you, I observed there were no errors.
Regarding your query, please allow me to answer them as follows:
Please understand here that Batch size is the maximum number of messages that lambda will read from the queue in one batch for a
single invocation. It should be considered as the maximum number of
messages (up to) that can be received in a single batch but not as a
fixed value that can be received at all times in a single invocation.
-- Please see "When Lambda invokes the target function, the event can contain multiple items, up to a configurable maximum batch size" in
the official documentation here [1] for more information on the same.
I would also like to add that, according to the internal architecture of how the SQS service is designed, Lambda pollers will
poll the messages from the queue using the "ReceiveMessage" API
calls and invokes the Lambda function.
-- Please refer the documentation [2] which states the following "If the number of messages in the queue is small (fewer than 1,000), you
most likely get fewer messages than you requested per ReceiveMessage
call. If the number of messages in the queue is extremely small, you
might not receive any messages in a particular ReceiveMessage
response. If this happens, repeat the request".
-- Thus, we can see that the number of messages that can be obtained in a single lambda invocation with a certain batch size depends on the
number of messages in an SQS queue and the SQS service internal
implementation.
Also, batch window is the maximum amount of time that the poller waits to gather the messages from the queue before invoking the
function. However, this applies when there are no messages in the
queue. Thus, as soon as there is a message in the queue, the Lambda
function will be invoked without any further due without waiting for
the batch window time specified. You can refer to the
"WaitTimeSeconds" parameter in the "ReceiveMessage" API.
-- The batch window just ensures that lambda starts polling after certain time so that enough messages are present in the queue.
However, there are other factors like size of messages, incoming
volume, etc that can affect this behavior.
Additionally, I would like to confirm that Polls from SQS in Lambda is of Synchronous invocation type and it has an invocation payload
limit size of 6MB. Please refer the following AWS Documentation for
more information on the same [3].
Having said that, I can confirm that this Lambda polling behaviour is
by design and not a bug. Please be rest assured that there are no
issues with the lambda and SQS service.
Our scenario is to archive to S3, and we want fewer larger files. Looks like our options are potentially kinesis, or running a custom receive application on something like ECS...

Related

What's the difference between the SQS batch window duration and ReceiveMessage wait time seconds?

You can specify SQS as an event source for Lambda functions, with the option of defining a batch window duration.
You can also specify the WaitTimeSeconds for a ReceiveMessage call.
What are the key differences between these two settings?
What are the use cases?
They're fundamentally different.
The receive message wait time setting is a setting that determines if your application will be doing long or short polling. You should (almost) always opt for long polling as it helps reduce your costs; the how is in the documentation.
It can be set in 2 different ways:
on the queue level, by setting the ReceiveMessageWaitTimeSeconds attribute
on the message level, by setting the WaitTimeSeconds property on your ReceiveMessage calls
It determines how long your application will wait for a message to become available in the queue before returning an empty result.
On the other hand, you can configure an SQS queue as an event source for Lambda functions by adding it as a trigger.
When creating an SQS trigger, you have 2 optional fields:
batch size (the number of messages in each batch to send to the function)
batch window (the maximum amount of time to gather SQS messages before invoking the function, in seconds)
The batch window function sets the MaximumBatchingWindowInSeconds attribute for SQS event source mapping.
It's the maximum amount of time, in seconds, that the Lambda poller waits to gather the messages from the queue before invoking the function. The batch window just ensures that more messages have accumulated in the SQS queue before the Lambda function is invoked. This increases the efficiency and reduces the frequency of Lambda invocations, helping you reduce costs.
It's important to note that it's defined as the maximum as it's not guaranteed.
As per the docs, your Lambda function may be invoked as soon as any of the below are true:
the batch size limit has been reached
the batching window has expired
the batch reaches the payload limit of 6 MB
To conclude, both features are used to control how long something waits but the resulting behaviour differs.
In the first case, you're controlling how long the poller (your application) could wait before it detects a message in your SQS queue & then immediately returns. You could set this value to 10 seconds but if a message is detected on the queue after 5 seconds, the call will return. You can change this value per message, or have a universal value set at the queue level. You can take advantage of long (or short) polling with or without Lambda functions, as it's available via the AWS API, console, CLI and any official SDK.
In the second case, you're controlling how long the poller (inbuilt Lambda poller) could wait before actually invoking your Lambda to process the messages. You could set this value to 10 second and even if a message is detected on the queue after 5 seconds, it may still not invoke your Lambda. Actual behaviour as to when your function is invoked, will differ based on batch size & payload limits. This value is naturally set at the Lambda level & not per message. This option is only available when using Lambda functions.
You can’t use both together as long/short polling is for a constantly running application or one-off calls. A Lambda function cannot poll SQS for more than 15 minutes and that is with a manual invocation.
For Lambda functions, you would use native SQS event sourcing and for any other service/application/use case, you would manually integrate SQS.
They're same in the sense that both aim to help you to ultimately reduce costs, but very different in terms of where you can use them.

SQS batching with Lambda event source mapping

I am trying to do some testing with SQS and Lambda and I have set the batch size to 10 and the batch window to 10 on the event source mapping between SQS and a function.
I sent 8 messages in an 8 second window and I expected to see a single function invocation after 10 seconds with all 8 messages in the event but actually what I observed was 4 separate function invocations with differing amounts in the message bodies.
Am I misunderstanding these configuration settings, or is there something on the queue settings which is causing this?
I am posting this in case this catches anyone else out but I did reach out to AWS and received the following response, which I hope might prove useful to other people:
I understand that you would like to use an event source mapping to trigger Lambda functions from SQS. In testing, you have noticed some strange behavior, specifically; by setting the batch size to 10 and sending 8 messages to the queue, you expected to see a single lambda invocation containing the batch of 8 messages, but rather what you observed was 4 lambda invocations. Please feel free to correct me if I have misunderstood at any point.
To explain, in order for Lambda functions with an SQS queue configured as an event source to scale optimally, the following should be true:
- The function isn't producing any errors.
- There are sufficient messages in the SQS queue.
- There is sufficient unreserved concurrency in the AWS Region, or the reserved concurrency for the function is at least 1,000 for standard queues or equivalent to the number of active message groups or higher for FIFO queues.
When messages are available, Lambda reads up to five batches and sends them to your function [1]. When there are fewer items in the queue, the batch will be smaller than the maximum batch size (i.e 10). Therefore, if your SQS queue has fewer than 1,000 messages, you're less likely to receive a full batch of 10 messages in your invocations.
From the above, you would only receive full batches if you have a large queue depth (i.e many messages in the queue - based on the ApproximateNumberOfMessagesVisible metric). As you were testing with a few messages and Lambda reads up to five batches when messages are available, each batch will have fewer than the batch size until there are sufficient messages in the queue to read full batches.

SNS > AWS Lambda asyncronous invocation queue vs. SNS > SQS > Lambda

Background
This archhitecture relies solely on Lambda's asyncronous invocation mechanism as described here:
https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html
I have a collector function that is invoked once a minute and fetches a batch of data in that can vary drastically in size (tens of of KB to potentially 1-3MB). The data contains a JSON array containing one-to-many records. The collector function segregates these records and publishes them individually to an SNS topic.
A parser function is subribed the SNS topic and has a concurrency limit of 3. SNS asynchronously invokes the parser function per record, meaning that the built-in AWS managed Lambda asyncronous queue begins to fill up as the instances of the parser maxes out at 3. The Lambda queueing mechanism initiates retries at incremental backups when throttling occurs, until the invocation request can be processed by the parser function.
It is imperitive that a record does not get lost during this process as they can not be resurrected. I will be using dead letter queues where needed to ensure they ultimately end up somewhere in case of error.
Testing this method out resulted in no lost invocation. Everything worked as expected. Lambda reported hundreds of throttle responses but I'm relying on this to initiate the Lambda retry behaviour for async invocations. My understanding is that this behaivour is effectively the same as that which I'd have to develop and initiate myself if I wanted to retry consuming a message coming from SQS.
Questions
1. Is the built-in AWS managed Lambda asyncronous queue reliable?
The parser could be subject to a consistent load of 200+ invocations per minute for prelonged periods so I want to understand whether the Lambda queue can handle this as sensibly as an SQS service. The main part that concerns me is this statement:
Even if your function doesn't return an error, it's possible for it to receive the same event from Lambda multiple times because the queue itself is eventually consistent. If the function can't keep up with incoming events, events might also be deleted from the queue without being sent to the function. Ensure that your function code gracefully handles duplicate events, and that you have enough concurrency available to handle all invocations.
This implies that an incoming invocation may just be deleted out of thin air. Also in my implementation I'm relying on the retry behaviour when a function throttles.
2. When a message is in the queue, what happens when the message timeout is exceeded?
I can't find a difinitive answer but I'm hoping the message would end up in the configured dead letter queue.
3. Why would I use SQS over the Lambda queue when SQS presents other problems?
See the articles below for arguments against SQS. Overpulling (described in the second link) is of particular concern:
https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes/
https://medium.com/#zaccharles/lambda-concurrency-limits-and-sqs-triggers-dont-mix-well-sometimes-eb23d90122e0
I can't find any articles or discussions of how the Lambda queue performs.
Thanks for reading!
Quite an interesting question. There's a presentation that covered queues in detail. I can't find it at the moment. The premise is the same as this queues are leaky buckets
So what if I add more Leaky Buckets. We'll you've delayed the leaking, however it's now leaking into another bucket. Have you solved the problem or delayed it?
What if I vibrate the buckets at different frequencies?
Further reading:
operate lambda
message expiry
message timeout
DDIA / DDIA Online
SQS Performance
sqs failure modes
mvce is missing from this question so I cannot address the the precise problem you are having.
As for an opinion on which to choose for SQS and Lambda queue I'll point to the Meta on this
sqs faq mentions Kinesis streams
sqs sns kinesis comparison
TL;DR;
It depends
I think the biggest advantage of using your own queue is the fact that you as a user have visibility into the state of the your backpressure.
Using the Lambda async invoke method, you have the potential to get throttled exceptions with the 'guarantee' that lambda will retry over an interval. If using a SQS source queue instead, you have complete visibility into the state of your message processing at all times with no ambiguity.
Secondly regarding overpulling. In theory this is a concern but in practice its never happened to me. I've run applications requiring thousands of transactions per second and never once had problems with SQS -> Lambda. Obviously set your retry policy appropriately and use a DLQ as transient/unpredictable errors CAN occur.

Processing AWS SQS messages with separate Lambda at a time

Like the title suggests, I have a scenario that I would like to explore but do not know how to go about it.
I have a lambda function processCSVFile. I also have a SQS queue that at a set time everyday, it gets populated with link of csv files from S3, let's say about 2000 messages. Now I want to process 25 messages at a time once the SQS queue has the messages.
The scenario I am looking for is to process 25 messages concurrently, I want the 25 messages to be processed by 25 lambda invocations separately. I thought I could use SendMessageBatch function in SQS but this only delivers messages to the queue, it does not seem to apply to my use case.
My question is, am I able to perform the action explained above and if it is possible, what documentation or use cases can explain what I am looking for.
Also, if this use case is impossible, what do you recommend as an alternative way to do the processing I want done concurrently.
To process 25 messages from Amazon SQS with 25 concurrent Lambda functions (1 message per running Lambda function), you would need:
A maximum concurrency of 25 configured for the Lambda function (otherwise it might go higher than this when more messages are available)
A batch size of 1 configured on the Lambda trigger so that SQS only passes it one message at a time
See:
AWS Lambda Function Scaling (Maximum concurrency)
Configuring a Queue as an Event Source (Batch size)
I think that combination of lambda's event source mapping for sqs
and setting reserved concurrency to 25 could be the way do go.
The lambda uses long pooling to prepare message batches for concurrent processing by lambda. Thus each invocation of your function could get more than 1 message at a time.
I don't think there is a way to set event source mapping to serve just one message per batch. If you absolute must ensure only one message is processed by lambda, then you process one and disregards others (put them back to queue).
The reserved concurrency of 25 guarantees that you wont be running more than 25 functions in parallel. If you leave it at its default value, you can run up to whatever free concurrency you have in your account.
Edit:
#JohnRotenstein already confirmed that there is a way to set lambda to pass message a time to your function.
Hope this helps.

Tracking a message on multiple AWS SQS queues

I have around 3 AWS Lambda functions taking the following form:
Lambda function 1: Reads from an SQS queue and puts a Message on an SQS queue (the incoming and outgoing message formats are different)
Lambda function 2: Reads the message from Lambda function 1, and puts a Message on an SQS queue (the incoming and outgoing message formats are different)
Lambda function 3: Reads the message from Lambda function 3, and updates storage.
There are 3 queues involved and the message format (structure) in each queue is different, however they have one uniqueId which are the same and can be used the relate one to each other. So much question is, is there any way in SQS on or some other tool to track the messages, what I'm specifically looking at is stuff like:
Time the message was entered into the queue
Time the message was taken by the Lambda function for processing
My problem is that the 3 Lambda functions individually perform within a couple of milliseconds, but the time taken for end to end execution is way too long, I suspect that the messages are taking too long in transit.
I'm open for any other ideas on optimisation.
AWS Step Functions is specifically designed for passing information between Lambda functions and orchestrating the whole process.
However, you would need to change the way you have written your functions to take advantage of Step Functions.
Since your only real desire is to explore why it takes "way too long", then AWS X-Ray should be a good way to gather this information. It can track a single transaction end-to-end through each process. I think it's just a matter of including a library and activating X-Ray.
See: AWS Lambda and AWS X-Ray - AWS X-Ray
Or, just start doing some manual investigations in the log files. The log shows how long each function takes to run, so you should be able to identify whether the time taken is within a particular Lambda function, or whether the time is spent between functions, waiting for them to trigger.