Timeout problems with bulk processing of files using AWS Lambda

Timeout problems with bulk processing of files using AWS Lambda - amazon-web-services

I have a lambda function which I'm expecting to exceed 15 minutes of execution time. What should I do so it will continuously run until I processed all of my files?

If you can, figure out how to scale your workload horizontally. This means splitting your workload so it runs on many lambdas instead of one "super" lambda. You don't provide a lot of details so I'll list a couple common ways of doing this:
Create an SQS queue and each lambda takes one item off of the queue and processes it.
Use an S3 trigger so that when a new file is added to a bucket a lambda processes that file.
If you absolutely need to process for longer than 15 minutes you can look into other serverless technologies like AWS Fargate. Non-serverless options might include AWS Batch or running EC2.

15 minutes is the maximum execution time available for AWS Lambda functions.
If your processing is taking more than that, then you should break it into more than one lambda. You can trigger them in sequence or in parallel depending on your execution logic.

Related

Is it possible to achieve parallel processing in AWS Lambda

I am having a python code in AWS Lambda which is triggered based on sqs event generated.
The criteria for generating sqs is if a new file comes into a particular S3 location, then sqs will be created and which in turn calls lambda.
Right now, lambda is processing the files one after the other in a serial mode. But I would like to know if we can process multiple files at the same time.
Example: If 5 files comes to s3 location, all the 5 files should be processed parallely at the same time.

I think you might miss observed the behavior of your system. If you using the Native SQS Standar Queue with Lambda integration, the Lambdas will consume the queus in batches, you can see a detailed explanation here:
https://aws.amazon.com/blogs/compute/understanding-how-aws-lambda-scales-when-subscribed-to-amazon-sqs-queues/

No need to add the SQS.
Enable triggering from the S3 PutObject action to the Lambda. With this, you can ensure invocations per object and also parallelism.

Also, check your Reserve concurrency value
Doc:https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
Check the concurrency dashboard in the monitoring sub section, it is already running in parallel mode.
File may look
Hope this helps!!!

batch processing s3 objects using lambda

The use case is that 1000s of very small-sized files are uploaded to s3 every minute and all the incoming objects are to be processed and stored in a separate bucket using lambda.
But using s3-object-create as a trigger will make many lambda invocations and concurrency needs to be taken care of. I am trying to batch process the newly created objects for every 5-10 minutes. S3 provides batch operations but it reports are generated everyday/week. Is there a service available that can help me?

According to AWS documentation, S3 can publish "New object created events" to following destinations:
Amazon SNS
Amazon SQS
AWS Lambda
In your case I would:
Create SQS.
Configure S3 Bucket to publish S3 new object events to SQS.
Reconfigure your existing Lambda to subscribe to SQS.
Configure batching for input SQS events.
Currently, the maximum batch size for SQS-Lambda subscription is 1000 events. But since your Lambda needs around 2 seconds to process single event, then you should start with something smaller, otherwise Lambda will timeout, because it won't be able to process all of the events.
Thanks to this, uploading X items to S3 will produce X / Y events, where Y is maximum batch size of SQS. For 1000 S3 items and batch size of 100, it will only invoke around 10 concurrent Lambda executions.
The AWS document mentioned above explains, how to publish S3 events to SQS. I won't explain it here, as it's more about implementation details.
Execution time
However you might run into a problem, where the processing is too slow, because Lambda will be processing probably events one-by-one in a loop.
The workaround would be to use asynchronous processing and implementation depends what runtime you use for Lambda, for Node.js it would be very easy to achieve.
Also if you want to speed up the processing in other ways, simply reduce maximum batch size and increase Lambda memory configuration, so single execution will be processing smaller number of events and will have access to more CPU units.

Maximizing number of parallel operation in AWS Lambda

I have an AWS Lambda which has to invoke an API endpoint for 2 million records. Considering that the maximum execution period of Lambda is 15 minutes. I have to somehow process all these records using one Lambda(that is in 15 minutes if possible). The API endpoint which I want to invoke can handle the TPS of 3000. I want to maximize/parallelize my calls so I can utilize the TPS provided and run the operations using a single Lambda. I have created my invocations within parallelStream in Java. Is is possible to do it using the current approach? If yes, What changes would I have to make in Lambda Runtime in order to use multi core?

Considering that the maximum execution period of Lambda is 15 minutes.
I have to somehow process all these records using one Lambda(that is
in 15 minutes if possible).
Why? This defeats the entire reason you would use AWS Lambda for this task. Why limit yourself to a single Lambda function invocation to do all this work?
If you wrote a script to take your 2 million records and add them to an SQS queue, then you could have the AWS Lambda service automatically feed these records into multiple, parallel instances of your AWS Lambda function. This would allow you to easily tune the number of Lambda functions you want to have running in parallel, and also automatically handle retries in the case of failures.

Serverless Task Scheduling on AWS

So our project was using Hangfire to dynamically schedule tasks but keeping in mind auto scaling of server instances we decided to do away with it. I was looking for cloud native serverless solution and decided to use CloudWatch Events with Lambda. I discovered later on that there is an upper limit on the number of Rules that can be created (100 per account) and that wouldn't scale automatically. So now I'm stuck and any suggestions would be great!

As per CloudWatch Events documentation you can request a limit increase.
100 per region per account. You can request a limit increase. For
instructions, see AWS Service Limits.
Before requesting a limit increase, examine your rules. You may have
multiple rules each matching to very specific events. Consider
broadening their scope by using fewer identifiers in your Event
Patterns in CloudWatch Events. In addition, a rule can invoke several
targets each time it matches an event. Consider adding more targets to
your rules.
If you're trying to create a serverless task scheduler one possible way could be:
CloudWatch Event that triggers a lambda function every minute.
Lambda function reads a DynamoDB table and decide which actions need to be executed at that time.
Lambda function could dispatch the execution to other functions or services.

So I decided to do as Diego suggested, use CloudWatch Events to trigger a Lambda every minute which would query DynamoDB to check for the tasks that need to be executed.
I had some concerns regarding the data that would be fetched from dynamoDb (duplicate items in case of longer than 1 minute of execution), so decided to set the concurrency to 1 for that Lambda.
I also had some concerns regarding executing those tasks directly from that Lambda itself (timeouts and tasks at the end of a long list) so what I'm doing is pushing the tasks to SQS each separately and another Lambda is triggered by the SQS to execute those tasks parallely. So far results look good, I'll keep updating this thread if anything comes up.

Delay Lambda execution over specific data

I am trying to come up with a way to have pieces of data processed at specific time intervals by invoking aws lambda every N hours.
For example, parse a page at specific url every 6 hours and store result in s3 bucket.
Have many (~100k) urls each processed that way.
Of course, you can have a VM that hosts some scheduler that would trigger lambdas, as described in this answer, but that breaks the "serverless" approach.
So, is there a way to do this using aws services only?
Things I tried that does not work:
SQS can delay messages, but only for maximum of 15 min (I need hours) and there is no built-in integration between SQS and Lambda so you need to have some polling agent (lambda?) that would poll the qeueu all the time and send new messages to worker lambda, which again breaks the point of only executing at scheduled time;
CloudWatch Alarms can send messages to SNS that triggers Lambda. You can have periodic lambda calls implemented like that by using future metric timestamp, however alarm message cannot have a custom data (think url from example above) connected to it, so that does not work too;
I could create Lambda CloudWatch scheduled triggers programmatically but they also cannot pass any data to Lambda.
The only way I could think of, is to have a dynamo DB table with "url" records, each with the timestamp of last "processing" and have periodic lambda that would query the table and send "old" records as jobs to another "worker" lambda (directly or via SNS).
That would work, however you still need to have a "polling" lambda, which could become a bottleneck as number of items to process grows.
Any other ideas?

100k jobs every 6 hours, doesn't sound like a great use case for Serverless IMO. Personally, I would set up a CloudWatch event with a relevant cron expression that triggered a Lambda to start an EC2 instance that processed all the URLs (stored in DynamoDB) and script the EC2 instance to shutdown after processing the last url.
But that's not what you asked.
You could set up a CloudWatch event with a relevant cron expression that spawns a lambda (orchestrator) reads the urls from DynamoDB or even an S3 file then invokes a second lambda (worker) for each url to actually parse the pages.
Using this pattern you will start hitting concurrency issues at 1000 lambdas (1 orchestrator & 999 workers), less if you have other lambdas running in the same region. You can ask AWS to increase this limit, but I don't know under what scenarios they will do this, or how high they will increase the limit.
From here you have three choices.
Split out the payload to each worker lambda so each instance receives multiple urls to process.
Add an another column to your list of urls and group urls with this column (e.g. first 500 are marked with a 1, second 500 are marked with a 2, etc). Then your orchestrator lambda could take urls off the list in batches. This would require you to run the CloudWatch event at a greater frequency and manage the state so the orchestrator lambda when invoked knows which is the next batch (I've done this at a smaller scale just storing a variable in a S2 file).
Would be to use some combination of options 1 and 2.

Looks like, it's fitting Batch processing scenario with AWS lambda function as a job. It's serverless but obviously adds dependency on another AWS service.
In the same time, it has dashboard, processing status, retries and all perks from job scheduling service.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js