I have 2 FIFO SQS queues which receives JSON messages that are to be indexed to elasticsearch. One queue is constantly adding delta changes to the database and adding them to the queue. The second queue is used for database re-indexing i.e. the entire 50Tb if data is to be indexing every couple of months (where everything is added to the queue). I have a lambda function that consumes the messages from the queues and places them into the appropriate queue (either the active index or the indexing being rebuilt).
How should I trigger the lambda function to best process the backlog of messages in SQS so it process both queues as quickly as possible?
A constraint I have is that the queue items need to be processed in order. If the lambda function could be run indefinitely without the 5 minute limit I could keep running one function that constantly processes messages.
Instead of pushing your messages directly into SQS you could publish the messages to a SNS Topic with 2 Subscriber registered.
Subscriber: SQS
Subscriber: Lambda Function
Has the benefit that your Lambda is invoked at the same time as the message is stored in SQS.
The standard way to do this is to use Cloudwatch Events that run periodically. This lets you pull data from the queue on a regular schedule.
Because you have to poll SQS this may not lead to the fastest processing of messages. Also, be careful if you constantly have messages to process - Lambda will end up being far more expensive than a small EC2 instance to handle the messages.
Not sure I fully understand your problem, but here are my 2 cents:
If you have a constant and real-time stream of data, consider using Kinesis Streams with 1 shard in order to preserve the FIFO. You may consume the data in batch of n items using lambda. Up to you to decide the batch size n and the memory size of lambda.
with this solution you pay a low constant price for Kinesis Streams and a variable price for Lambdas.
Should you really are in love with SQS and the real-time does not metter, you may consume items with Lambdas or EC2 or Batch. Either you trigger many lambdas with CloudWatch Events, either you keep alive an EC2, either you trigger on a regular basis an AWS Batch job.
there is an economic equation to explore, each solution is the best for one use case and the worst for another, make your choice ;)
I prefer SQS + Lambdas when there are few items to consume and SQS + Batch when there are a lot of items to consume.
You may probably also consider using SNS + SQS + Lambdas like #maikay says in his answer, but I wouldn't choose that solution.
Hope it helps. Feel free to ask for clarifications. Good luck!
Related
I am trying to find what is the limit of messages that i can pull from sqs using lambda.
I saw some documentation from AWS that we can only recieve 10 messages at a time in the form of batch.
But my requirement is to pull 10-12k messages at a time from the sqs queue using lambda and loop through it.
Not sure this requirement is doable want to know if it is even possible.
Here is the document from Amazon which is similar to your question similar to your question
For example, a Lambda function receives messages from an SQS queue and writes to a DynamoDB table. It has a reserved concurrency of 10 with a batch size of 10 items. The SQS queue rapidly receives 1,000 messages. The Lambda function scales up to 10 concurrent instances, each processing 10 messages from the queue.
That means Lambda helps you to auto scale when you need to process a bunch of messages.
In this document, Amazon says that the concurrent executions can be increased up to tens of thousands, which I think it answers your question but please be noticed that the burst concurrency quotas is different from the regions
This one is the real case example with Ebay API
For my real project, our practice is try to split the logic into smaller ones, which means we split into some smaller Lambdas and SQSs so that we can manage the logic, debug and also help to reduce the traffic.
And one thing you should also monitor is the price! Do not forget to monitor and consider for your solution.
We noticed that when setting up an AWS lambda to trigger from SQS that a lot of times the trigger happens minutes and sometimes up to an hour delay to trigger. I know AWS lambda does polling internally and when the queue is empty it probably does some exponential backoff.
However, we have a scheduler that runs every 30 min and pushes data into the queue. However, lambda is triggered much much later for a % of messages. Our business requirement that it triggers within a min.
Is there a way to force lambda to check the queue consistently? An alternative was to uses step functions but this is not possible due to another answer in this thread --> How do you run functions in parallel?
I was also thinking about pushing data into s3 and have lambda trigger from s3 asynchronously vs being polled but s3 does not have a batch api when we want to push a lot of records so that's out.
It turned out to be the wrong use of async/await when using AWS SDK. They only support .promise(). That was the reason that not all messages ended up in sqs.
Hope it helps others. AWS is working on a new sdk that will support async/await. Here is the link for their
https://github.com/aws/aws-sdk-js-v3/issues/153#issuecomment-457769969
I would check to make sure the SQS is either long pulling or short pulling.
"In almost all cases, Amazon SQS long polling is preferable to short polling. Long-polling requests let your queue consumers receive messages as soon as they arrive in your queue while reducing the number of empty ReceiveMessageResponse instances returned.
Amazon SQS long polling results in higher performance at reduced cost in the majority of use cases. However, if your application expects an immediate response from a ReceiveMessage call, you might not be able to take advantage of long polling without some modifications to your application.
For example, if your application uses a single thread to poll multiple queues, switching from short polling to long polling will probably not work, because the single thread will wait for the long-poll timeout on any empty queues, delaying the processing of any queues that might contain messages.
In such an application, it is a good practice to use a single thread to process only one queue, allowing the application to take advantage of the benefits that Amazon SQS long polling provides."
Also, if you're looking to fire off your lambda function, could you set up an SNS notification system to your SQS? Something along the lines of SQS SNS Lambda. This should get you sub minute and you're not constantly pulling the queue for messages. You'll just do it on a SNS.
https://aws.amazon.com/sqs/faqs/
I have an AWS Lambda function which processes events from S3. I'd like to aggregate them before processing and let lambda process the batch.
This is depicted below:
Ideally, I'd like to be able to specify a batch size, and a timeout (say a single even, and then nothing for 5 sec, I'd like to send an 1-event batch).
Is there an idiomatic way to do it using Lambda or other AWS services?
There are a few things you can do:
1. Make upstream do the aggregation:
Make publishing the publisher's responsibility, and get the publisher to give you one event per group of objects to process. This works well if the publisher is already working in batches.
2. Insert your own aggregation step:
Trigger on each event.
Store the event somewhere.
If enough events have been stored, empty the store and pass all the contents to the processing step.
This works well if your processing step is much more expensive per event than just handling the event. Often, this can take the form of {aggregating lambda} -> {processing batch job}, since Lambda isn't great for very expensive processing.
3. Do aggregation on a time basis:
Send your events to an SQS queue.
Trigger on a timer (e.g. Cloudwatch events).
When triggered, empty the queue and process everything in it. If it's too much to process in a single invocation, immediately trigger an additional lambda.
This works well if processing is fairly cheap, and you want to minimize your number of Lambda invocations. The trigger schedule (how long you wait in between invocations) is determined by weighing how long you're willing to wait to process an event against how many invocations you're willing to pay for. Things to watch out for: 1. if you get no events at all, you will still be invoking your Lambda, and 2. if you get events faster than they can be processed, your queue will grow more and more and your processing will fall further and further behind.
I think you can achieve the batch operation by setting SQS queue as destination for S3 notification. Let's say you want to specify a batch size of 20, all your S3 events are going to SQS. You would create a CloudWatch rule to trigger a Lambda when your SQS have 20 items. Your Lambda would poll SQS for the batch of 20 items and process them.
You can also set SQS triggers but it has a limit of max batch size 10.
I’m working on figuring out the best way to have Lambda run one time tasks at a given time.
The system I’m envisioning will basically have events that will need to be sent out, either as soon as the event is received/created, at a specific time, or as a recurring action. And I’d like to use AWS as much as possible for this, due to the scalable nature.
My original idea was to have a AWS SQS queue for events to send. Then I’d have a DynamoDB table for future events. I’d also have two AWS Lambda functions, one setup to run on a cron job every few minutes to take the events that are scheduled in the next 15 minutes or so from the DynamoDB table and put them into that AWS SQS queue with a Message Timer setup to delay the message from being visible for that given time. The second Lambda function would be setup and have a trigger to be run from that AWS SQS queue. This function would be responsible for actually sending the event out.
From there I could either add the event to the SQS queue (with or without a message timer) if it’s gonna need to be sent out within the next 15 minutes. Or add it to the DynamoDB table if it’s gonna need to be sent out in the future (beyond 15 minutes).
The biggest problem I just figured out is that AWS SQS FIFO queues doesn’t support Message Timers on individual messages. I need a FIFO queue because I need to prevent these events from being sent out multiple times, or triggering my second Lambda function twice.
I've also looked into the AWS Lambda cron jobs, and although you can schedule invocations every say 5 minutes, I don't think this is what I'm looking for because I'm looking more for scheduling a 1 time invocation in the future, and having that be scalable. So I don't think this is what I'm looking for.
Any ideas on how I can achieve this, since it doesn’t look like Amazon SQS Message Timers will work for what I'm trying to do?
Have you considered Step Function? You could create a wait state before invoking the lambda.
I have a Lambda function that’s triggered by a PUT to an S3 bucket.
I want to limit this Lambda function so that it’s only running one instance at a time – I don’t want two instances running concurrently.
I’ve had a look through the Lambda configuration and docs, but I can’t see anything obvious. I can about writing my own locking system, but it would be nice if this was already a solved problem.
How can I limit the number of concurrent invocations of a Lambda?
AWS Lambda now supports concurrency limits on individual functions:
https://aws.amazon.com/about-aws/whats-new/2017/11/set-concurrency-limits-on-individual-aws-lambda-functions/
I would suggest you to use Kinesis Streams (or alternatively DynamoDB + DynamoDB Streams, which essentially have the same behavior).
You can see Kinesis Streams as as queue. The good part is that you can use a Kinesis Stream as a Trigger to you Lambda function. So anything that gets inserted into this queue will automatically be passed over to your function, in order. So you will be able to process those S3 events one by one, one Lambda execution after the other (one instance at a time).
In order to do that, you'll need to create a Lambda function with the simple purpose of getting S3 Events and putting them into a Kinesis Stream. Then you'll configure that Kinesis Stream as your Lambda Trigger.
When you configure the Kinesis Stream as your Lambda Trigger I suggest you to use the following configuration:
Batch size: 1
This means that your Lambda will be called with only one event from Kinesis. You can select a higher number and you'll get a list of events of that size (for example, if you want to process the last 10 events in one Lambda execution instead of 10 consecutive Lambda executions).
Starting position: Trim horizon
This means it'll behave as a queue (FIFO)
A bit more info on AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AWS Lambda.
I hope this helps anyone with a similar problem.
P.S. Bear in mind that Kinesis Streams have their own pricing. Using DynamoDB + DynamoDB Streams might be cheaper (or even free due to the non-expiring Free Tier of DynamoDB).
No, this is one of the things I'd really like to see Lambda support, but currently it does not. One of the problems is that if there were a lot of S3 PUT operations happening AWS would have to queue up all the Lambda invocations somehow, and there is currently no support for that.
If you built a locking mechanism into your Lambda function, what would you do with the requests you don't process due to a lock? Would you just throw those S3 notifications away?
The solution most people recommend is to have S3 send the notifications to an SQS queue, and then have your Lambda function scheduled to run periodically, like once a minute, and check if there is an item in the queue that needs to be processed.
Alternatively, have S3 send the notifications to SQS and just have a t2.nano EC2 instance with a single-threaded service polling the queue.
I know this is an old thread, but I ran across it trying to figure out how to make sure my time sequenced SQS messages were processed in order coming out of a FIFO queue and not getting processed simultaneously/out-of-order via multiple Lambda threads running.
Per the documentation:
For FIFO queues, Lambda sends messages to your function in the order
that it receives them. When you send a message to a FIFO queue, you
specify a message group ID. Amazon SQS ensures that messages in the
same group are delivered to Lambda in order. Lambda sorts the messages
into groups and sends only one batch at a time for a group. If your
function returns an error, the function attempts all retries on the
affected messages before Lambda receives additional messages from the
same group.
Your function can scale in concurrency to the number of active message
groups.
Link: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
So essentially, as long as you use a FIFO queue and submit your messages that need to stay in sequence with the same MessageGroupID, SQS/Lambda automatically handles the sequencing without any additional settings necessary.
Have the S3 "Put events" cause a message to be placed on the queue (instead of involving a lambda function). The message should contain a reference to the S3 object. Then SCHEDULE a lambda to "SHORT POLL the entire queue".
PS: S3 events can not trigger a Kinesis Stream... only SQS, SMS, Lambda (see http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html#supported-notification-destinations). Kinesis Stream are expensive and used for real-time event handling.