We're at the beginning stages of writing a AWS Lambda to copy massive amounts of s3 files within S3.
This Lambda will be triggered from S3.
Any advice about what the max triggers are that Lambda can handle at one time? for example, if we dump 10,000 trigger files in the S3 trigger folder, will Lambda handle this pretty well or will it throttle itself back enough to slow the whole thing down? Would there be a better recommended number?
I have noticed the triggers to Lambda (Works with SNS) are not scalable with high number of concurrent objects copied. There are other limits that takes this to a different issue. Lambda concurrency limits, Lambda Cold Start.
One way we got the triggers to work is to send it to SNS and forward it to lambda from there. It will be queued and delivered by SNS. You will see certain latency with the roundtrip to SNS.
Hope it helps.
EDIT1:
On the other hand, if you still want to retain the trigger without worrying about time. It worked fine with a 500milliseconds between each successful copy. So that it will not fire all trigger at once.
This was our testing first and worked successfully but took longer time. With SNS no throttling required.
Related
I have a scheduled error handling lambda, I would like to use Serverless technology here as opposed to a spring boot service or something.
The lambda will read from an s3 bucket and process accordingly. The problem is at times the s3 bucket may have high volume of data to be processed. long running operations aren't suited to lambdas.
One solution I can think of is have the lambda read and process one item from the bucket and on success trigger another instance of the same lambda unless the bucket is empty/fully-processed. The thing i don't like is that this is synchronous and quite slow. I also need to be conscious of running too many lambdas at the same time as we are hitting a REST endpoint as part of the error flow and don't want to overload it with too many requests.
I am thinking it would be nice to have maybe 3 instances of the lambdas running at the same time until the bucket is empty but not really sure, I am wondering if anyone has any nice patterns that could be used here or suggestions on best practices?
Thanks
Create a S3 bucket for processing your files.
Enable a trigger S3 -> Lambda, on every new file in the bucket lambda will be invoked to process the file, every file is processed separately. https://docs.aws.amazon.com/AmazonS3/latest/user-guide/enable-event-notifications.html
Once the file is processed you could either delete or move file to other place.
About concurrency please have a look at provisioned concurrency https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
Update:
As you still plan to use a scheduler lambda and S3
Lambda reads/lists only the filenames and puts messages into SQS to process the file.
A new Lambda to consume SQS messages and process the file.
Note: I would recommend using SQS initially if the files/messages are not so big, it has built it recovery mechanics, DLQ , delays, visibility etc which you could benefit more than the simple S3 storage, second way is just create message with file reference and still use SQS.
I'd separate the lambda that is called by the scheduler from the lambda that is doing the actual processing. When the scheduler calls the first lambda, it can look at the contents of the bucket, then spawn the worker lambdas to process the objects. This way you have control over how many objects you want per worker.
Given your requirements, I would recommend:
Configure an Amazon S3 Event so that a message is pushed to an Amazon SQS queue when the objects are created in the S3 bucket
Schedule an AWS Lambda function at regular intervals that will:
Check that the external service is working
Invoke a Lambda function to process one message from the queue, and keep looping
The hard part would be throttling the second Lambda function so that it doesn't try to send all request at once (which might impact that external service).
You could probably do this by using a Step Function to trigger Lambda and then, if it was successful, trigger another Lambda function. This could even be done in parallel, such as allowing up to three parallel Lambda executions. The benefit of using Step Functions is that there is no cost for "waiting" for each Lambda to finish executing.
So, the Step Function flow would be something like:
Invoke a "check external service" Lambda function
If it fails, then quit the flow
Invoke the "processing" Lambda function
Get one message
Process the message
If successful, remove the message from the queue
Return success/fail
If it was successful, keep looping until the queue is empty
We have an external service that uploads files to our S3 bucket in account A. We raise SNS notifications on each upload. A lambda function in account B subscribes to these notifications.
This works well for us, except that if the external service misses a configuration, it uploads >500 files together (in a single directory). And our lambda is triggered 500 times when that happens.
1. Is there a way to limit the number of files uploaded to a bucket within X minutes?
2. Is there a way to stop the lambda from getting invoked if it sees >500 SNS notifications together?
I am aware that placing an SQS between the Lambda and SNS will probably solve our problem. I want to know if there is another easier, more convenient way to solve this.
I explored the possibility of limiting the lambda concurrency so that is fails on throttling, however SNS notifications will be retried thrice (which is also a good thing and we don't want to lose this feature in case of other errors), so we do not want to do that.
Note that instant processing is not a hard requirement for us. We can wait for around 5 minutes to process the SNS notification.
No, it is not possible to limit uploads to Amazon S3 within a given time period.
Nor is it possible to stop Lambda being invoked if it sees more than a given quantity of Amazon SNS notifications.
I have an S3 bucket with different files. I need to read those files and publish SQS msg for each row in the file.
I cannot use S3 events as the files need to be processed with a delay - put to SQS after a month.
I can write a scheduler to do this task, read and publish. But can I was AWS for this purpose?
AWS Batch or AWS data pipeline or Lambda.?
I need to pass the date(filename) of the data to be read and published.
Edit : The data volume to be dealt is huge
I can think of two ways to do this entirely using AWS serverless offerings without even having to write a scheduler.
You could use S3 events to start a Step Function that waits for a month before reading the S3 file and sending messages through SQS.
With a little more work, you could use S3 events to trigger a Lambda function which writes the messages to DynamoDB with a TTL of one month in the future. When the TTL expires, you can have another Lambda that listens to the DynamoDB streams, and when there’s a delete event, it publishes the message to SQS. (A good introduction to this general strategy can be found here.)
While the second strategy might require more effort, you might find it less expensive than using Step Functions depending on the overall message throughput and whether or not the S3 uploads occur in bursts or in a smooth distribution.
At the core, you need to do two things:
Enumerate all of the object in a bucket in S3, and perform some action on any object uploaded more than a month ago.
Can you use Lambda or Batch to do this? Sure. A Lambda could be set to trigger once a day, enumerate the files, and post the results to SQS.
Should you? No clue. A lot depends on your scale, and what you plan to do if it takes a long time to perform this work. If your S3 bucket has hundreds of objects, it won't be a problem. If it has billions, your Lambda will need to be able to handle being interrupted, and continuing paging through files from a previous run.
Alternatively, you could use S3 events to trigger a simple Lambda that adds a row to a database. Then, again, some Lambda could run on a cron job that asks the database for old rows, and publishes that set to SQS for others to consume. That's slightly cleaner, maybe, and can handle scaling up to pretty big bucket sizes.
Or, you could do the paging through files, deciding what to do, and processing old files all on a t2.micro if you just need to do some simple work to a few dozen files every day.
It all depends on your workload and needs.
I set up my AWS workflow so that my lambda function will be triggered when a text file is added to my S3 bucket, and generally, it worked fine - when I upload a bunch of text files to the S3 bucket, a bunch of lambda will be running at the same time and process each text file.
But my issue is that occasionally, 1 or 2 files (out of 20k or so in total) did not trigger the lambda function as expected. I have no idea why - when I checked the logs, it's NOT that the file is processed by the lambda but failed. The log showed that the lambda was not trigger by that 1 or 2 files at all. I don't believe it's reaching the 1000 concurrent lambda limitation as well since my function runs faster and the peak is around 200 lambdas.
My question is: is this because AWS lambda does not guarantee it will be triggered 100%? Like the S3, there is always a (albeit tiny) possibility of failure? If not, how can I debug and fix this issue?
You don't mention how long the Lambdas take to execute. The default limit of concurrent executions is 1000. If you are uploading files faster than they can be processed with 1000 Lambdas then you'll want to reach out to AWS support and get your limit increased.
Also from the docs:
Amazon S3 event notifications typically deliver events in seconds but can sometimes take a minute or longer. On very rare occasions, events might be lost.
If your application requires particular semantics (for example, ensuring that no events are missed, or that operations run only once), we recommend that you account for missed and duplicate events when designing your application. You can audit for missed events by using the LIST Objects API or Amazon S3 Inventory reports. The LIST Objects API and Amazon S3 inventory reports are subject to eventual consistency and might not reflect recently added or deleted objects.
My current AWS Lambda function invokes another AWS Lambda function but I want to make sure that the invoke succeeded. After looking at concurrent execution limits for AWS Lambda I am trying to figure out what would happen if the concurrent limit is hit and I tried to invoke the Lambda from another Lambda.
For now, I am solving this problem by putting messages in an SNS but I rather prefer invoking Lambda directly avoiding the indirection.
The best way to handle the concurrent limit is to use a Kinesis stream rather than SNS.
The number of shards will limit the number of lambda invoked. And if it pertinent for you, you can take several messages at once, which you can't do with SNS, and that can lead to hit the concurrent limit.
Can you elaborate a little? Not sure I Understand what you are trying to achieve.
Lambda limits can be viewed under AWS console / EC2 page, top left corner has menu item called Limits, there you should see the limit.
When you hit the limit, lambda will stop being Invoked, and if my memory serves me right you will see an error in the logs saying something about limit being hit.
From the AWS Lambda FAQs:
Q: What happens if my account exceeds the default throttle limit on concurrent executions?
On exceeding the throttle limit, AWS Lambda functions being invoked
synchronously will return a throttling error (429 error code). Lambda
functions being invoked asynchronously can absorb reasonable bursts of
traffic for approximately 15-30 minutes, after which incoming events
will be rejected as throttled. In case the Lambda function is being
invoked in response to Amazon S3 events, events rejected by AWS Lambda
may be retained and retried by S3 for 24 hours. Events from Amazon
Kinesis streams and Amazon DynamoDB streams are retried until the
Lambda function succeeds or the data expires. Amazon Kinesis and
Amazon DynamoDB Streams retain data for 24 hours.
Inside the AWS Console you can always create a Service Limit Increase for AWS Lambda concurrent executions at no additional cost. This answer explains this in more detail.
I believe you're handling it correctly currently. I was just reading an article that was explaining how you shouldn't invoke lambda from another lambda because:
"If you do, the first will run until the second is finished executing, and you’re double billing yourself. Instead, use SNS or SQS to send a message to the other Lambda."
http://web.archive.org/web/20160713113906/http://www.appliedsoftwaredesign.com/archives/aws-lambda-pro-tips/