AWS Lambda depleting the SQS queue very slowly - amazon-web-services

I have an SQS queue and a lambda which consume the queue with batch size 10.
Lambda
Reserve concurrency = 600
Timeout = 15 minutes
Memory = 640 MB (but using 150-200 MB per execution)
Processing one item comes from the queue takes about 10 seconds.
SQS
Messages Available (Visible): 5,310
Messages in Flight (Not Visible): 3,355
Default Visibility Timeout: 20 minutes
With these settings, I'm expecting my Lambda function to be invoked 600 times because as you see the queue is full and there are items to received from the queue. So, the function shouldn't be idle and use all of the available concurrency.
I'm aware of the first 1 minute of burst and later my concurrency will increase every minute until hitting to the limit. But my invocation count is always between 40-80. Never hits to 600 and my queue is depleted very slowly. And (according to logs) almost any of the queue items are failing, so they are not going back to queue again.
What is wrong with my settings?
EDIT:
Also another chart:
Increased up for a moment and decreased again..

Related

AWS lambda throttling retries

I have a question about Lambda's asynchronous invocation: https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html
If the function doesn't have enough concurrency available to process
all events, additional requests are throttled. For throttling errors
(429) and system errors (500-series), Lambda returns the event to the
queue and attempts to run the function again for up to 6 hours. The
retry interval increases exponentially from 1 second after the first
attempt to a maximum of 5 minutes. If the queue contains many entries,
Lambda increases the retry interval and reduces the rate at which it
reads events from the queue.
From the doc, it seems like if I set a reserved concurrency for a lambda function and it couldn't process events due to throttling, the event could be retried for up to 6 hours. However, it doesn't say anything about the total number of retries. How will it be different from the scenario where the lambda function returns an error (it will get retried a maximum of 2 times)?
If the queue contains many entries, Lambda increases the retry interval and reduces the rate at which it reads events from the queue.
It seems that lambda retry only if there are enouge concurrency for it. If not, It will wait up to 6 hours.

Delay in getting messages from AWS SQS

I am adding messages in SQS on Lambda and then receiving the messages inside a container on ECS.
The problem is there is a 10-15 seconds of delay when I am receiving the messages on the container.
On the container a loop is running indefinitely every 1 second where I am getting messages and if available processing it.
Example:
Suppose the message is added in SQS at 15:20:00 but I am able to get that message at 15:20:15 on ECS. These 15 seconds are too long for my use case.
Can this time be reduced ?
Assuming that there are multiple producers and consumers is there any alternative solution ?
If your workers are continually polling the Amazon SQS queue, they can reduce the amount of requests by specifying WaitTimeSeconds=20 (which is its maximum value).
This tells Amazon SQS to wait until at least one message is available, to a maximum of 20 seconds. If no messages are available after 20 seconds, an empty set of messages is returned. However, if one or more messages appear in the queue, then the call returns immediately without waiting for 20 seconds.
This reduces the frequency of calls to SQS and might increase stability in your application.

SQS batching for Lambda trigger doesn't work as expected

I have 2 Lambda Functions and an SQS queue inbetween.
The first Lambda sends the messages to the Queue.
Then second Lambda has a trigger for this Queue with a batch size of 250 and a batch window of 65 seconds.
I expect the second Lambda to be triggered in batches of 250 messages after about every 65 seconds. In the second Lambda I'm calling a 3rd party API that is limited to 250 API calls per minute (I get 250 tokens per minute).
I tested this setup with for 32.000 messages being added to the queue and the second Lambda didn't pick up the messages in batches as expected. At first it got executed for 15k messages and then there were not enough tokens so it did not process those messages.
The 3rd party API is based on a token bucket with a fill rate of 250 per minute and a maximum capacity of 15.000. It managed to process the first 15.000 messages due to the bucket capacity and then didn't have enough capacity to handle the rest.
I don't understand what went wrong.
The misunderstanding is probably related to how Lambda handles scaling.
Whenever there are more events than a single Lambda execution context/instance can handle, Lambda just creates more execution contexts/instances to process these events.
What probably happened is that Lambda saw there are a bunch of messages in the queue and it tries to work on these as fast as possible. It created a Lambda instance to handle the first event and then talked to SQS and asked for more work. When it got the next batch of messages, the first instance was still busy, so it scaled out and created a second one that worked on the second batch in parallel, etc. etc.
That's how you ended up going through your token budget in a few minutes.
You can limit how many functions Lambda is allowed to execute in parallel by using reserved concurrency - here are the docs for reference. If you set the reserved concurrency to 1, there will be no parallelization and only one Lambda is allowed to work on the messages.
This however opens you up to another issue. If that single Lambda takes less than 60 seconds to process the messages, Lambda will call it again with another batch ASAP and you might go over your budget again.
At this point a relatively simple approach would be to make sure that your lambda function always takes about 60 seconds by adding a sleep for the remaining time at the end.

Optimizing SQS Lambda configuration for single concurrency microservice composition

Apologies for the title. It's hard to summarise what I'm trying to accomplish.
Basically let me define a service to be an SQS Queue + A Lambda function.
A service (represented by square brackets below) performs a given task, where the queue is the input interface, processes the input, and outputs on to the queue of the subsequent service.
Service 1 Service 2 Service 3
[(APIG) -> (Lambda)] -> [(SQS) -> (Lambda)] -> [(SQS) -> (Lambda)] -> ...
Service 1: Consumes the request and payload, splits it into messages and passes on to the queue of the next service.
Service 2: This service does not have a reserved concurrency. It validates each message on the queue, and if valid, passes on to the next service.
Service 3: Processes each message in the queue (ideally in batches of approximately 100). The lambda here must have a reserved concurrency of 1 (as it hits an API that can't process multiple requests concurrently).
Currently I have the following configuration on Service 3.
Default visibility timeout of queue = 5 minutes
Lambda timeout = 5 minutes
Lambda reserved concurrency = 1
Problem 1: Service 3 consumes x items off the queue and if it finishes processing them within 30 seconds I expect the queue to process the next x items off the queue immediately (ideally x=100). Instead, it seems to always wait 5 minutes before taking the next batch of messages off the queue, even if the lambda completes in 30 seconds.
Problem 2: Service 3 typically consumes a few messages at a time (inconsistent) rather than batches of 100.
A couple of more notes:
In service 3 I do not explicitly delete messages off the queue using the lambda. AWS seems to do this itself when the lambda successfully finishes processing the messages
In service 2 I have one item per message. And so when I send messages to Service 3 I can only send 10 items at a time, which is kind of annoying. Because queue.send_messages(Entries=x), len(x) cannot exceed 10.
Does anyone know how I solve Problem 1 and 2? Is it an issue with my configuration? If you require any further information please ask in comments.
Thanks
Both your problems and notes indicate misconfigured SQS and/or Lambda function.
In service 3 I do not explicitly delete messages off the queue using
the lambda. AWS seems to do this itself when the lambda successfully
finishes processing the messages.
This is definitely not the case here as it would go agains the reliability of SQS. How would SQS know that the message was successfully processed by your Lambda function? SQS doesn't care about consumers and doesn't really communicate with them and that is exactly the reason why there is a thing such as visibility timeout. SQS deletes message in two cases, either it receives DeleteMessage API call specifying which message to be deleted via ReceiptHandle or you have set up redrive policy with maximum receive count set to 1. In such case, SQS will automatically send message to dead letter queue when if it receives it more than 1 time which means that every message that was returned to the queue will be send there instead of staying in the queue. Last thing that can cause this is a low value of Message Retention Period (min 60 seconds) which will drop the message after x seconds.
Problem 1: Service 3 consumes x items off the queue and if it finishes
processing them within 30 seconds I expect the queue to process the
next x items off the queue immediately (ideally x=100). Instead, it
seems to always wait 5 minutes before taking the next batch of
messages off the queue, even if the lambda completes in 30 seconds.
This simply doesn't happen if everything is working as it should. If the lambda function finishes in 30 seconds, if there is reserved concurrency for the function and if there are messages in the queue then it will start processing the message right away.
The only thing that could cause is that your lambda (together with concurrency limit) is timing out which would explain those 5 minutes. Make sure that it really finishes in 30 seconds, you can monitor this via CloudWatch. The fact that the message has been successfully processed doesn't necessarily mean that the function has returned. Also make sure that there are messages to be processed when the function ends.
Problem 2: Service 3 typically consumes a few messages at a time
(inconsistent) rather than batches of 100.
It can never consume 100 messages since the limit is 10 (messages in the sense of SQS message not the actual data that is stored within the message which can be anywhere up to 256 KB, possibly "more" using extended SQS library or similar custom solution). Moreover, there is no guarantee that the Lambda will receive 10 messages in each batch. It depends on the Receive Message Wait Time setting. If you are using short polling (1 second) then only subset of servers which are storing the messages will be polled and a single message is stored only on a subset of those servers. If those two subsets do not match when the message is polled, the message is not received in that batch. You can control this by increasing polling interval, Receive Message Wait Time, (max 20 seconds) but even if there are not enough messages in the queue when the timer finishes, the batch will still be received with fewer messages, possibly zero.
And as it was mentioned in the comments, using this strategy with concurrency set to low number can lead to some problems. Another thing is that you need to ensure that rate at which messages are produced is somehow consistent with the time it takes for one instance of lambda function to process the message otherwise you will end up with constantly growing queue, possibly losing messages after they outlive the Message Retention Period.

Efficient method to read messages from sqs without polling consecutively

I am very new to AWS SQS queues and I am currently playing around with python and boto.
Now I am able to read messages from SQS by polling consecutively.
The script is as follows:
while 1:
m = q.read(wait_time_seconds=10)
if m:
print m
How do I make this script constantly listen for new additions to the queue without using while loop?
Is there a way to write a Python consumer for SQS that doesn't have to poll periodically for new messages?
Not really... that's how SQS works. If a message arrives during the wait, it will be returned almost immediately.
This is not the inefficient operation that it seems like.
If you increase your timeout to the max allowed 20 seconds, then, worst case, you will generate no more than about 3 x 60 x 24 x 30 = 129,600 "empty" polls per month... × $0.00000050 per poll = $0.0648. (The first 1,000,000 requests are billed at $0.)
Note that during the timeout, if a new message arrives, it will return almost immediately, not wait the full 20 sec.