I want to know when a batch of messages has completed in a AWS SQS Queue - amazon-web-services

I think this is more of a 'architecture design' question.
I have a lambda producer that will put ~600 messages on a SQS queue (there are multiple producers) as a batch (so not 1 message with a body of ~600 messages). A consumer lambda that will take individual messages and deal with them (at scale). What I want to do is run another lambda when each batch is complete.
Initial ideas was to create a 'unique batch number', a 'total batch number' and a 'batch position number' and add it to the messages attributes for every message. And then in the consumer lambda check the these to decide if the batch is complete.
But does that mean I would need to use a FIFO queue and partition on the batch number and only have one lambda consumer per batch. Or do I run some sort of state management in DynamoDB (is the a pattern out there for this? please guide me on this).
Regards, J

It seems like the goal is to achieve Fork-Join capabilities in a distributed system. One way to handle this in AWS is using Step Functions. Assuming a queue service needs to be used, state of the overall operation will need to be tracked. Some ways to do this are:
Store state of the overall operation in a DB.
Put a 'terminatation' message in the queue after all others and process FIFO.
Create a metadata service which receives 'start' and 'stop' messages for each service and handles them accordingly.
Reference: Fork and Join with Amazon Lambda

Related

How to tell when Lambdas complete processing of all messages in SQS

Currently I have a process where a Lambda (A) gets triggered which has logic to find out what customers need to have another lambda (B) run for (via a queue). For any run there could be 3k to 4k messages placed on the SQS Queue by Lambda A to be picked up by Lambda B to process. As Lambda B communicates with an external Api, the concurrency is set to 10 for Lambda B so as not to overload the Api. The whole process completes in 35 to 45 minutes.
My problem is how to tell when all the processing is complete?
If you don't need timely information, you could check out the CloudWatch Metrics that SQS offers, e.g.:
ApproximateNumberOfMessagesVisible
The number of messages available for retrieval from the queue.
Reporting Criteria: A non-negative value is reported if the queue is active.
and
ApproximateNumberOfMessagesNotVisible
The number of messages that are in flight. Messages are considered to be in flight if they have been sent to a client but have not yet been deleted or have not yet reached the end of their visibility window.
Reporting Criteria: A non-negative value is reported if the queue is active.
If the sum of these two metrics hits zero, no messages are in the Queue, and processing should be done.
If you need more timely information, the producer of the messages could increment a counter item in DynamoDB with the number of messages added, and each Lambda decrements that counter once it's done. You could then add a Lambda to the DynamoDB Stream of that table with a filter and do something when the value changes to zero again. This is, however, much more complex.
A third option could be to transform the whole thing into a stepfunction and use a map state with a parallelization factor to work on the tasks. The drawback is that the length of the list it can work on is limited afaik.

AWS: concurrent queue system?

Ok So I have a few users wanting to perform a task at the same time.
The task is executed on a lambda and takes a few seconds.
I want to limit the number of parallels execution of the task to 3.
Meaning if 10 users call it at the same time, 3 tasks will run for 3 users, and the 7 remaining users will be put in a waiting queue.
When one user's task finish, before exiting, the task lookup in the waiting queue, extract another user from the queue, and launch a new task for this user (fire and forget), then exit.
I need to be sure that there is no execution duplicate, so I'm storing the queue on a small server (in lightsail) and in the RAM, but I'm not sure that would scale well.
Can I have a similar workflow in a serverless way? not sure using SQS can solve my issue
What you're asking for can be achieved with SQS and Lambda simply with this:
subscribe the Lambda function to the SQS queue with a batch size of 1, so a function gets executed as soon as a message is in the queue and only one message is handled per invocation
throttle that Lambda function to 3 concurrent invocations, so no more than 3 tasks are handled in parallel

Kinesis- like sharding with SQS?

I'm interested in replacing Kinesis (because it's expensive and I don't need the historic log) with SQS, but I need a sharding/partitioning mechanism, specifically when processing with Lambda.
I see SQS FIFO queues have recently acquired Lambda event mapping -
https://aws.amazon.com/blogs/compute/new-for-aws-lambda-sqs-fifo-as-an-event-source/
which I think brings partitioning tantalisingly close via the use of MessageGroupID.
Message processing is described as proceeding via the following rules -
1) Return the oldest message where no other message with the same MessageGroupId is in flight.
2) Return as many messages with the same MessageGroupId as possible.
3) If a message batch is still not full, go back to the first rule. As a result, it’s possible for a single batch to contain messages from multiple MessageGroupIds.
1) and 2) sound great - each Lambda request batch containing a single MessageGroupID only - but then 3) seems to mess it all up :-(
Any thoughts on a strategy to ensure every Lambda request batch only contains messages from a single MessageGroupID ? Possibly via MessageDeduplicationID ?
Suspect the answer here is just to use one queue per “partition” as SQS pricing works on a per-message basis, not per-queue. If you have a lot of “partitions” then create them programmatically during the stack spin up process (e.g. as part of a CodeBuild script), rather than defining each and every queue in CloudFormation.

How to trigger AWS Lambda just once on multiple S3 notifications

We are designing a pipeline. We get a number of raw files which come into S3 buckets and then we apply a schema and then save them as parquet.
As of now we are triggering a lambda function for each file written but ideally we would like to start this process only after all the files are written. How we can we trigger the lambda just once?
I encourage you to use an alternative that maintains the separation between the publisher (whoever is writing) and the subscriber (you). The publisher tells you when things are written; it's your responsibility to choose when to process those things. The neat pattern here would be for the publisher to write its files in batches and publish manifests for you to trigger on: i.e. a list which says "I just wrote all these things, you can find them in these places". Since you don't have that / can't change the publisher, I suggest the following:
Send the notifications from the publisher to an SQS queue.
Schedule your lambda to run on a schedule; how often is determined by how long you're willing to delay ingestion. If you want data to be delayed at most 5min between being published and getting ingested by your system, set your lambda to trigger every 4min. You can use Cloudwatch notifications for this.
When your lambda runs, poll the queue. Keep going until you accumulate the maximum amount of notifications, X, you want to process in one go, or the queue is empty.
Process. If the queue wasn't empty when you stopped polling, immediately trigger another lambda execution.
Things to keep in mind on the above:
As written, it's not parallel, so if your rate of lambda execution is slower than the rate at which the queue fills up, you'll need to 1. run more frequently or 2. insert a load-balancing step: a lambda that is triggered on a schedule, polls the queue, and calls as many processing lambdas as necessary so that each one gets X notifications.
SNS in general and SQS non-FIFO queues specifically don't guarantee exactly-once delivery. They can send you duplicate notifications. Make sure you can handle duplicate processing cleanly.
Hook your Lambda up to a Webhook (API Gateway) and then just call it from your client app once your client app is done.
Solutions:
Zip all files together, Lambda unzip it
create a UI code and send files one by one, trigger lambda from it when the last one is sent
Lambda check files, if didn't find all files, silent quit. if it finds all files, then handle all files in one thread

How to ensure once-only processing of data in an AWS serverless architecture?

I have some data that needs to be processed at a point in time.
My current strategy is to pull the data every minute and load it into a queue and process it.
I have two concerns with this strategy:
I can't guarantee that the last minute captures all data so I pull the last two minutes; and
Lambdas as far as I know can fire multiple times depending on the trigger (in this case SQS.)
I'm trying to avoid writing a flag to the data because of the spikey nature of batch processing.
The only other solution I can think of is using S3 to create a lock-file.
Is there a better way to 'kick off' future events? Is there a strategy outside database and S3 flags?
Have a look at SQS FIFO Queues, they are designed to deliver once and only once.
You can now use Amazon Simple Queue Service (SQS) for applications that require messages to be processed in a strict sequence and exactly once using First-in, First-out (FIFO) queues. FIFO queues are designed to ensure that the order in which messages are sent and received is strictly preserved and that each message is processed exactly once. ...source