How to tell when Lambdas complete processing of all messages in SQS - amazon-web-services

Currently I have a process where a Lambda (A) gets triggered which has logic to find out what customers need to have another lambda (B) run for (via a queue). For any run there could be 3k to 4k messages placed on the SQS Queue by Lambda A to be picked up by Lambda B to process. As Lambda B communicates with an external Api, the concurrency is set to 10 for Lambda B so as not to overload the Api. The whole process completes in 35 to 45 minutes.
My problem is how to tell when all the processing is complete?

If you don't need timely information, you could check out the CloudWatch Metrics that SQS offers, e.g.:
ApproximateNumberOfMessagesVisible
The number of messages available for retrieval from the queue.
Reporting Criteria: A non-negative value is reported if the queue is active.
and
ApproximateNumberOfMessagesNotVisible
The number of messages that are in flight. Messages are considered to be in flight if they have been sent to a client but have not yet been deleted or have not yet reached the end of their visibility window.
Reporting Criteria: A non-negative value is reported if the queue is active.
If the sum of these two metrics hits zero, no messages are in the Queue, and processing should be done.
If you need more timely information, the producer of the messages could increment a counter item in DynamoDB with the number of messages added, and each Lambda decrements that counter once it's done. You could then add a Lambda to the DynamoDB Stream of that table with a filter and do something when the value changes to zero again. This is, however, much more complex.
A third option could be to transform the whole thing into a stepfunction and use a map state with a parallelization factor to work on the tasks. The drawback is that the length of the list it can work on is limited afaik.

Related

SQS Lambda Trigger polling rate

I'm trying to understand how SQS Lambda Triggers works when polling for messages from the Queue.
Criteria
I'm trying to make sure that not more than 3 messages are processed within a period of 1 second.
Idea
My idea is to set the trigger BatchSize to 3 and setting the ReceiveMessageWaitTimeSeconds of the queue to 1 second. Am I thinking about this correctly?
Edit:
I did some digging and looks like I can set a concurrency limit on my Lambda. If I set my Lambda concurrency limit to one that ensures only one batch of message gets processed at a time. If my lambda runs for a second, then the next batch of messages gets processed at least a second later. The gotcha here is long-polling auto scales the number of asychronous polling on the queue based on message volume. This means, the lambdas can potentailly throttle when a large number of messages comes in. When the lambdas throttle, the message goes back to the queue until it eventually goes into the DLQ.
ReceiveMessageWaitTimeSeconds is used for long polling. It is the length of time, in seconds, for which a ReceiveMessage action waits for messages to arrive (docs). Long polling does not mean that your client will wait for the full length of the time set. If you have it set to one second, but in the queue we already have enough messages, your client will consume them instantaneously and will try to consume again as soon as processing is completed.
If you want to consume certain number of messages at certain rate, you have do this on your application (for example consumes messages on a scheduled basis). SQS by itself does not provide any kind of rate limiting similar to what you would want to accomplish.

How to space apart retries of consumer/lambda processing of SQS message to being 10 hours apart

I would like to understand the limits with respect to how long consumer message processing attempts can be spaced apart. For example, suppose I have the following AWS Resources
SQS Queue (named "SQSQueueName1") w/ redrive configured to send dead letter messages to SQSQueueName1DLQ
SQS Queue DLQ (named "SQSQueueName1DLQ")
Lambda Function (named "LambdaName1")
If SQSQueueName1 has a redrive policy with MaxRecieveCount set to 10, how long are the attempts by the consumer to process this message spaced apart in this scenario? Is there any control I have over the duration of time between consumer attempts? For example, can I space them apart such that attempts happen within 10 hours? Or is this control completely non-existant such that all control is delegated to the negotiation between the lambda pollers and the sqs (using visibility timeout + redrive)?
Again, my goal is to see if its technically possible to control the amount of time between invocations to a set amount of time, say 10 hours. 24 hours.
SQS queues have messageVisibilityTimeout parameter that controls exactly what you want. It is set to a duration, with max value 12 hours. After a message is read by a consumer, the message will be invisible to anyone else for the duration of messageVisibilityTimeout. So if you set it to 10 hours, your message will only retried after 10 hours.
Lambda triggers is not related to this parameter at all. When you trigger a Lambda function with an SQS, Lambda does a long poll to the SQS queue, in other words asks for new messages constantly. However, regardless of how many requests Lambda makes to SQS, if the message is not visible, Lambda won't read it.

Get SQS message from a priority queue

I have 3 SQS queues:
HighPQueue1
MediumPQueue2
LowPQueue3
Messages are inserted in the queue based on the API gateway REST API call. If the message is of high priority, it goes to HighPQueue1. If the message is medium, it goes to MediumPQueue2. If the message is low, it goes to LowPQueue3.
The messages from these 3 queues has to be read in priority order. How can I do that using AWS?
I have thought about creating a Lambda and then checking if message is available first in HighPQueue1, then in MediumPQueue2 and then in LowPQueue3. Would that be the right approach?
I have to trigger AWS step functions for each SQS message depending on the priority. I want to limit to 10 concurrent requests for my AWS step functions at any given point in time.
You won't be able to use the lambda integration for this, but you could still use lambda if you want to start a new invocation every so often. I think what you are suggesting for the pattern is correct (check high, then medium, then low). Here are some things to keep in mind.
Make sure when you are checking the medium and low queues that you only request one message at a time if it's important that the high queue messages are processed quickly.
If you process any message you start over. In other words don't make the mistake of processing a high item and then checking the medium queue. Always start over.
Lambda may not be your best option if you are polling queues. You'll effectively have lambda compute running all the time. That still may be okay if this is the only workload running and you are staying within, or close to within, the free tier.
Consider handling multiple requests at the same time. Is there something in your downstream infrastructure that limits you to processing one message at a time? If not, I would skip this model entirely and go with one queue backed by lambda and running processes in parallel when multiple come in.

SQS and Lambda: Limit max. amount of processed messages

If using SQS as an event source for a Lambda function, is there a way to limit the maximum amount of "active" messages to x. So, imagine there's a SQS queue with 1000 messages but instead of trying to process as many messages as possible (up to the default concurrency limit of 1000) we only want to process up to x messages at the same time. This obviously means that it'll take more time to process all messages but it would give us a possibility to better control e.g. writes to a database.
Also, in case a message can't be processed (due to e.g. an error that occurred in the Lambda function), is the message appended to the end of the queue (so all other messages are coming first) or is there a way to prioritise them after a certain waiting time (visibility timeout)?
Many thanks
As for throttling a queue, you could of added a Delivery Delay time or make it long polling but as yours is event driven this isn't a choice. So this leaves you with throttling your lambda to x many you want done a concurrently.
As for the messages which cant be processed that depends whether you are using
- standard queue, which wont hold any prioritization which message is picked up next.
- a .fifo queue Which will try to process it again as it would be next in line chronologically.
But if you caught the error you should send it straight to a dead letter queue to prevent unnecessary retries.
Although by throttling it you're removing all scalability of AWS, which is against its native architecture. Id recommend going back to the Database and seeing if any work can be improved there instead to avoid throttling.
From Reserving Concurrency for a Lambda Function - AWS Lambda:
You can configure a function with reserved concurrency to guarantee that it can always reach a certain level of concurrency. Reserving concurrency also limits the maximum concurrency for the function.
...
Your function can't scale out of control – Reserved concurrency also limits your function from using concurrency from the unreserved pool, capping it's maximum concurrency. Reserve concurrency to prevent your function from using all the available concurrency in the region, or from overloading downstream resources.
If a message is not processed within the invisibility timeout period, it is placed back on the queue. There is no guarantee of ordering of messages in Amazon SQS unless you are using a FIFO queue, which has further limitations on in-flight messages.

Optimizing SQS Lambda configuration for single concurrency microservice composition

Apologies for the title. It's hard to summarise what I'm trying to accomplish.
Basically let me define a service to be an SQS Queue + A Lambda function.
A service (represented by square brackets below) performs a given task, where the queue is the input interface, processes the input, and outputs on to the queue of the subsequent service.
Service 1 Service 2 Service 3
[(APIG) -> (Lambda)] -> [(SQS) -> (Lambda)] -> [(SQS) -> (Lambda)] -> ...
Service 1: Consumes the request and payload, splits it into messages and passes on to the queue of the next service.
Service 2: This service does not have a reserved concurrency. It validates each message on the queue, and if valid, passes on to the next service.
Service 3: Processes each message in the queue (ideally in batches of approximately 100). The lambda here must have a reserved concurrency of 1 (as it hits an API that can't process multiple requests concurrently).
Currently I have the following configuration on Service 3.
Default visibility timeout of queue = 5 minutes
Lambda timeout = 5 minutes
Lambda reserved concurrency = 1
Problem 1: Service 3 consumes x items off the queue and if it finishes processing them within 30 seconds I expect the queue to process the next x items off the queue immediately (ideally x=100). Instead, it seems to always wait 5 minutes before taking the next batch of messages off the queue, even if the lambda completes in 30 seconds.
Problem 2: Service 3 typically consumes a few messages at a time (inconsistent) rather than batches of 100.
A couple of more notes:
In service 3 I do not explicitly delete messages off the queue using the lambda. AWS seems to do this itself when the lambda successfully finishes processing the messages
In service 2 I have one item per message. And so when I send messages to Service 3 I can only send 10 items at a time, which is kind of annoying. Because queue.send_messages(Entries=x), len(x) cannot exceed 10.
Does anyone know how I solve Problem 1 and 2? Is it an issue with my configuration? If you require any further information please ask in comments.
Thanks
Both your problems and notes indicate misconfigured SQS and/or Lambda function.
In service 3 I do not explicitly delete messages off the queue using
the lambda. AWS seems to do this itself when the lambda successfully
finishes processing the messages.
This is definitely not the case here as it would go agains the reliability of SQS. How would SQS know that the message was successfully processed by your Lambda function? SQS doesn't care about consumers and doesn't really communicate with them and that is exactly the reason why there is a thing such as visibility timeout. SQS deletes message in two cases, either it receives DeleteMessage API call specifying which message to be deleted via ReceiptHandle or you have set up redrive policy with maximum receive count set to 1. In such case, SQS will automatically send message to dead letter queue when if it receives it more than 1 time which means that every message that was returned to the queue will be send there instead of staying in the queue. Last thing that can cause this is a low value of Message Retention Period (min 60 seconds) which will drop the message after x seconds.
Problem 1: Service 3 consumes x items off the queue and if it finishes
processing them within 30 seconds I expect the queue to process the
next x items off the queue immediately (ideally x=100). Instead, it
seems to always wait 5 minutes before taking the next batch of
messages off the queue, even if the lambda completes in 30 seconds.
This simply doesn't happen if everything is working as it should. If the lambda function finishes in 30 seconds, if there is reserved concurrency for the function and if there are messages in the queue then it will start processing the message right away.
The only thing that could cause is that your lambda (together with concurrency limit) is timing out which would explain those 5 minutes. Make sure that it really finishes in 30 seconds, you can monitor this via CloudWatch. The fact that the message has been successfully processed doesn't necessarily mean that the function has returned. Also make sure that there are messages to be processed when the function ends.
Problem 2: Service 3 typically consumes a few messages at a time
(inconsistent) rather than batches of 100.
It can never consume 100 messages since the limit is 10 (messages in the sense of SQS message not the actual data that is stored within the message which can be anywhere up to 256 KB, possibly "more" using extended SQS library or similar custom solution). Moreover, there is no guarantee that the Lambda will receive 10 messages in each batch. It depends on the Receive Message Wait Time setting. If you are using short polling (1 second) then only subset of servers which are storing the messages will be polled and a single message is stored only on a subset of those servers. If those two subsets do not match when the message is polled, the message is not received in that batch. You can control this by increasing polling interval, Receive Message Wait Time, (max 20 seconds) but even if there are not enough messages in the queue when the timer finishes, the batch will still be received with fewer messages, possibly zero.
And as it was mentioned in the comments, using this strategy with concurrency set to low number can lead to some problems. Another thing is that you need to ensure that rate at which messages are produced is somehow consistent with the time it takes for one instance of lambda function to process the message otherwise you will end up with constantly growing queue, possibly losing messages after they outlive the Message Retention Period.