batch processing s3 objects using lambda - amazon-web-services

The use case is that 1000s of very small-sized files are uploaded to s3 every minute and all the incoming objects are to be processed and stored in a separate bucket using lambda.
But using s3-object-create as a trigger will make many lambda invocations and concurrency needs to be taken care of. I am trying to batch process the newly created objects for every 5-10 minutes. S3 provides batch operations but it reports are generated everyday/week. Is there a service available that can help me?

According to AWS documentation, S3 can publish "New object created events" to following destinations:
Amazon SNS
Amazon SQS
AWS Lambda
In your case I would:
Create SQS.
Configure S3 Bucket to publish S3 new object events to SQS.
Reconfigure your existing Lambda to subscribe to SQS.
Configure batching for input SQS events.
Currently, the maximum batch size for SQS-Lambda subscription is 1000 events. But since your Lambda needs around 2 seconds to process single event, then you should start with something smaller, otherwise Lambda will timeout, because it won't be able to process all of the events.
Thanks to this, uploading X items to S3 will produce X / Y events, where Y is maximum batch size of SQS. For 1000 S3 items and batch size of 100, it will only invoke around 10 concurrent Lambda executions.
The AWS document mentioned above explains, how to publish S3 events to SQS. I won't explain it here, as it's more about implementation details.
Execution time
However you might run into a problem, where the processing is too slow, because Lambda will be processing probably events one-by-one in a loop.
The workaround would be to use asynchronous processing and implementation depends what runtime you use for Lambda, for Node.js it would be very easy to achieve.
Also if you want to speed up the processing in other ways, simply reduce maximum batch size and increase Lambda memory configuration, so single execution will be processing smaller number of events and will have access to more CPU units.

Related

AWS Lambda read from SQS without concurrency

My requirement is like this.
Read from a SQS every 2 hours, take all the messages available and then process it.
Processing includes creating a file with details from SQS messages and sending it to an sftp server.
I implemented a AWS Lambda to achieve point 1. I have a Lambda which has an sqs trigger. I have set batch size as 50 and then batch window as 2 hours. My assumption was that Lambda will get triggered every 2 hours and 50 messages will be delivered to the lambda function in one go and I will create a file for every 50 records.
But I observed that my lambda function is triggered with varied number of messages(sometimes 50 sometimes 20, sometimes 5 etc) even though I have configured batch size as 50.
After reading some documentation I got to know(I am not sure) that there are 5 long polling connections which lambda spawns to read from SQS and this is causing this behaviour of lambda function being triggered with varied number of messages.
My question is
Is my assumption on 5 parallel connections being established correct? If yes, is there a way I can control it? I want this to happen in a single thread / connection
If 1 is not possible, what other alternative do I have here. I do not want to have one file created for every few records. I want one file to be generated every two hours with all the messages in sqs.
A "SQS Trigger" for Lambda is implemented with the so-called Event Source Mapping integration, which polls, batches and deletes messages from the queue on your behalf. It's designed for continuous polling, although you can disable it. You can set a maximum batch size of up to 10,000 records a function receives (BatchSize) and a maximum of 300s long polling time (MaximumBatchingWindowInSeconds). That doesn't meet your once-every-two-hours requirement.
Two alternatives:
Remove the Event Source Mapping. Instead, trigger the Lambda every two hours on a schedule with an EventBridge rule. Your Lambda is responsible for the SQS ReceiveMessage and DeleteMessageBatch operations. This approach ensures your Lambda will be invoked only once per cron event.
Keep the Event Source Mapping. Process messages as they arrive, accumulating the partial results in S3. Once every two hours, run a second, EventBridge-triggered Lambda, which bundles the partial results from S3 and sends them to the SFTP server. You don't control the number of Lambda invocations.
Note on scaling:
<Edit (mid-Jan 2023): AWS Lambda now supports SQS Maximum Concurrency>
AWS Lambda now supports setting Maximum Concurrency to the Amazon SQS event source, a more direct and less fiddly way to control concurrency than with reserved concurrency. The Maximum Concurrency setting limits the number of concurrent instances of the function that an Amazon SQS event source can invoke. The valid range is 2-1000 concurrent instances.
The create and update Event Source Mapping APIs now have a ScalingConfig option for SQS:
aws lambda update-event-source-mapping \
--uuid "a1b2c3d4-5678-90ab-cdef-11111EXAMPLE" \
--scaling-config '{"MaximumConcurrency":2}' # valid range is 2-1000
</Edit>
With the SQS Event Source Mapping integration you can tweak the batch settings, but ultimately the Lambda service is in charge of Lambda scaling. As the AWS Blog Understanding how AWS Lambda scales with Amazon SQS standard queues says:
Lambda consumes messages in batches, starting at five concurrent batches with five functions at a time. If there are more messages in the queue, Lambda adds up to 60 functions per minute, up to 1,000 functions, to consume those messages.
You could theoretically restrict the number of concurrent Lambda executions with reserved concurrency, but you would risk dropped messages due to throttling errors.
You could try to set the ReservedConcurrency of the function to 1. That may help. See the docs for reference.
A simple solution would be to create a CloudWatch Event Trigger (similar to a Cronjob) that triggers your Lambda function every two hours. In the Lambda function, you call ReceiveMessage on the Queue until you get all messages, process them and afterward delete them from the Queue. The drawback is that there may be too many messages to process within 15 minutes so that's something you'd have to manage.

Periodic Read from AWS S3 and publish to SQS

I have an S3 bucket with different files. I need to read those files and publish SQS msg for each row in the file.
I cannot use S3 events as the files need to be processed with a delay - put to SQS after a month.
I can write a scheduler to do this task, read and publish. But can I was AWS for this purpose?
AWS Batch or AWS data pipeline or Lambda.?
I need to pass the date(filename) of the data to be read and published.
Edit : The data volume to be dealt is huge
I can think of two ways to do this entirely using AWS serverless offerings without even having to write a scheduler.
You could use S3 events to start a Step Function that waits for a month before reading the S3 file and sending messages through SQS.
With a little more work, you could use S3 events to trigger a Lambda function which writes the messages to DynamoDB with a TTL of one month in the future. When the TTL expires, you can have another Lambda that listens to the DynamoDB streams, and when there’s a delete event, it publishes the message to SQS. (A good introduction to this general strategy can be found here.)
While the second strategy might require more effort, you might find it less expensive than using Step Functions depending on the overall message throughput and whether or not the S3 uploads occur in bursts or in a smooth distribution.
At the core, you need to do two things:
Enumerate all of the object in a bucket in S3, and perform some action on any object uploaded more than a month ago.
Can you use Lambda or Batch to do this? Sure. A Lambda could be set to trigger once a day, enumerate the files, and post the results to SQS.
Should you? No clue. A lot depends on your scale, and what you plan to do if it takes a long time to perform this work. If your S3 bucket has hundreds of objects, it won't be a problem. If it has billions, your Lambda will need to be able to handle being interrupted, and continuing paging through files from a previous run.
Alternatively, you could use S3 events to trigger a simple Lambda that adds a row to a database. Then, again, some Lambda could run on a cron job that asks the database for old rows, and publishes that set to SQS for others to consume. That's slightly cleaner, maybe, and can handle scaling up to pretty big bucket sizes.
Or, you could do the paging through files, deciding what to do, and processing old files all on a t2.micro if you just need to do some simple work to a few dozen files every day.
It all depends on your workload and needs.

How to determine how many times my lambda executed for a certain of time

I have one Lambda it is executed on s3 put item trigger.
Now in s3 any objects uploaded lambda is triggering..
Let say Some one uploaded 5 files in s3 so each time it will execute the lambda for 5 files...
Is there any way that lambda can trigger only one time for all those 5 files...
Can I trace after complete of 5 time triggers/lambda execution...How many minutes lambda is not executing as no files uploaded..
Any help will really helpful for me
If you have the bucket notification configured for object create (s3:ObjectCreated) and if you haven't specified a filter or filter satisfies the uploaded objects your lambda will get triggered each time for per uploaded object.
To see the number of invocations, you can look at the Invocations metric for your lambda function in Cloudwatch metrics
You may want to add a queue that will handle the requests to process new files on S3.
A relevant one could be Kinesis data stream / SQS. If batching is important to you, Kinesis will be probably better.
The requests can be sent by a lambda triggered by S3 as you described, but it will only send the request to the queue, and another lambda will then process it. A simpler way will be to send the request in the same code that puts the object in S3 (if possible).
This way you can have statistics of how many requests were sent, processed, waiting, etc.

While putting files in S3 triggering them takes latency

I got this flow over AWS:
Put file on S3 -> trigger -> lambda function that inserts item to
DynamoDB -> see that I actually got new item ove DynamoDB
While I'm uploading few files (about 5-10) to S3, which triggering the lambda call, it takes time to see the expected results inside my DynamoDB.
It seems like there is a queue which being handeled behind the scenes of the S3 trigger. Becuase when i'm uploading few more files, those which didn't seen before are now presented as an item in DynamoDB.
My expected result is to see new Item in DynamoDB by each file(s) upload to S3 in the second it was made.
Is there a way to handle this issue using any configuration ?
I think the above scenario is related to "Concurrent Execution" in Lambda as you are trying to upload 5-10 files.
Every Lambda function is allocated with a fixed amount of specific
resources regardless of the memory allocation, and each function is
allocated with a fixed amount of code storage per function and per
account.
AWS Lambda Account Limits Per Region = 100 Default Limit
Limits
Concurrent Executions - Refer Event Based Sources (e.g. S3)
You can use the following formula to estimate your concurrent Lambda function invocations:
events (or requests) per second * function duration
For example, consider a Lambda function that processes Amazon S3 events. Suppose that the Lambda function takes on average three seconds and Amazon S3 publishes 10 events per second. Then, you will have 30 concurrent executions of your Lambda function.
To increase the limit:-
Refer "To request a limit increase for concurrent executions" section in above link.
AWS may automatically raise the concurrent execution limit on your
behalf to enable your function to match the incoming event rate, as in
the case of triggering the function from an Amazon S3 bucket.

Controlling Lambda + Kinesis Costs

We have a .NET client application that uploads files to S3. There is an event notification registered on the bucket which triggers a Lambda to process the file. If we need to do maintenance, then we suspend our processing by removing the event notification and adding it back later when we're ready to resume processing.
To process the backlog of files that have queued up in S3 during the period the event notification was disabled, we write a record to a kinesis stream with the S3 key to each file, and we have an event mapping that lets Lambda consume each kinesis record. This works great for us because it allows us to control our concurrency when we are processing a large backlog by controlling the number of shards in the stream. We were originally using SNS but when we had thousands of files that needed to be reprocessed SNS would keep starting Lambdas until we hit our concurrent executions threshold, which is why we switched to Kinesis.
The problem we're facing right now is that the cost of kinesis is killing us, even though we barely use it. We get 150 - 200 files uploaded per minute, and our lambda takes about 15 seconds to process each one. If we suspend processing for a few hours we end up with thousands of files to process. We could easily reprocess them with a 128 shard stream, however that would cost us $1,400 / month. The current cost for running our Lambda each month is less than $300. It seems terrible that we have to increase our COGS by 400% just to be able to control our concurrency level during a recovery scenario.
I could attempt to keep the stream size small by default and then resize it on the fly before we re-process a large backlog, however resizing a stream from 1 shard up to 128 takes an incredibly long time. If we're trying to recover from an unplanned outage then we can't afford to sit around waiting for the stream to resize before we can use it. So my questions are:
Can anyone recommend an alternative pattern to using kinesis shards for being able to control the upper bound on the number of concurrent lambdas draining a queue?
Is there something I am missing which would allow us to use Kinesis more cost efficiently?
You can use SQS with Lambda or Worker EC2s.
Here is how it can be achieved (2 approaches):
1. Serverless Approach
S3 -> SNS -> SQS -> Lambda Sceduler -> Lambda
Use SQS instead of Kinesis for storing S3 Paths
Use a Lambda Scheduler to keep polling messages (S3 paths) from SQS
Invoke Lambda function from Lambda scheduler for processing files
2. EC2 Approach
S3 -> SNS -> SQS -> Beanstalk Worker
Use SQS instead of Kinesis for storing S3 Paths
Use Beanstalk Worker environment which polls SQS automatically
Implement the application (processing logic) in the Beanstalk worker hosted locally on a HTTP server in the same EC2