I set up my AWS workflow so that my lambda function will be triggered when a text file is added to my S3 bucket, and generally, it worked fine - when I upload a bunch of text files to the S3 bucket, a bunch of lambda will be running at the same time and process each text file.
But my issue is that occasionally, 1 or 2 files (out of 20k or so in total) did not trigger the lambda function as expected. I have no idea why - when I checked the logs, it's NOT that the file is processed by the lambda but failed. The log showed that the lambda was not trigger by that 1 or 2 files at all. I don't believe it's reaching the 1000 concurrent lambda limitation as well since my function runs faster and the peak is around 200 lambdas.
My question is: is this because AWS lambda does not guarantee it will be triggered 100%? Like the S3, there is always a (albeit tiny) possibility of failure? If not, how can I debug and fix this issue?
You don't mention how long the Lambdas take to execute. The default limit of concurrent executions is 1000. If you are uploading files faster than they can be processed with 1000 Lambdas then you'll want to reach out to AWS support and get your limit increased.
Also from the docs:
Amazon S3 event notifications typically deliver events in seconds but can sometimes take a minute or longer. On very rare occasions, events might be lost.
If your application requires particular semantics (for example, ensuring that no events are missed, or that operations run only once), we recommend that you account for missed and duplicate events when designing your application. You can audit for missed events by using the LIST Objects API or Amazon S3 Inventory reports. The LIST Objects API and Amazon S3 inventory reports are subject to eventual consistency and might not reflect recently added or deleted objects.
Related
I have project with the following workflow:
Pull files from server and upload to S3
When files hit S3, a message is sent to a topic using SNS
The lambda function subscribed to said topic will then process files by doing calculations
So far, I have not experienced any issues, but I wonder if this is a use case for SQS?
How many messages can my Lambda function handle all at once? If I am pulling hundreds of files, is SQS a necessity?
By default, parallel invocation limit is set for 1000.
You can change that limitation, but i never hit that number so far.
As soon as a lambda is done with consuming current request, it will be used for another, so if you upload 1000 files, you probably will only need about 100 lambdas, unless you need Minutes for one lambda to run.
The AWS handles the queued triggers, so even if you upload 100.000 files, they will be consumed asap, depending on diverse criteria.
You can test it with creating many little files and upload them all at once :)
For higher speed, upload them to different bucket, and simply move from bucket to bucket (speed is higher this way)
gl !
We have an external service that uploads files to our S3 bucket in account A. We raise SNS notifications on each upload. A lambda function in account B subscribes to these notifications.
This works well for us, except that if the external service misses a configuration, it uploads >500 files together (in a single directory). And our lambda is triggered 500 times when that happens.
1. Is there a way to limit the number of files uploaded to a bucket within X minutes?
2. Is there a way to stop the lambda from getting invoked if it sees >500 SNS notifications together?
I am aware that placing an SQS between the Lambda and SNS will probably solve our problem. I want to know if there is another easier, more convenient way to solve this.
I explored the possibility of limiting the lambda concurrency so that is fails on throttling, however SNS notifications will be retried thrice (which is also a good thing and we don't want to lose this feature in case of other errors), so we do not want to do that.
Note that instant processing is not a hard requirement for us. We can wait for around 5 minutes to process the SNS notification.
No, it is not possible to limit uploads to Amazon S3 within a given time period.
Nor is it possible to stop Lambda being invoked if it sees more than a given quantity of Amazon SNS notifications.
I have an S3 bucket with different files. I need to read those files and publish SQS msg for each row in the file.
I cannot use S3 events as the files need to be processed with a delay - put to SQS after a month.
I can write a scheduler to do this task, read and publish. But can I was AWS for this purpose?
AWS Batch or AWS data pipeline or Lambda.?
I need to pass the date(filename) of the data to be read and published.
Edit : The data volume to be dealt is huge
I can think of two ways to do this entirely using AWS serverless offerings without even having to write a scheduler.
You could use S3 events to start a Step Function that waits for a month before reading the S3 file and sending messages through SQS.
With a little more work, you could use S3 events to trigger a Lambda function which writes the messages to DynamoDB with a TTL of one month in the future. When the TTL expires, you can have another Lambda that listens to the DynamoDB streams, and when there’s a delete event, it publishes the message to SQS. (A good introduction to this general strategy can be found here.)
While the second strategy might require more effort, you might find it less expensive than using Step Functions depending on the overall message throughput and whether or not the S3 uploads occur in bursts or in a smooth distribution.
At the core, you need to do two things:
Enumerate all of the object in a bucket in S3, and perform some action on any object uploaded more than a month ago.
Can you use Lambda or Batch to do this? Sure. A Lambda could be set to trigger once a day, enumerate the files, and post the results to SQS.
Should you? No clue. A lot depends on your scale, and what you plan to do if it takes a long time to perform this work. If your S3 bucket has hundreds of objects, it won't be a problem. If it has billions, your Lambda will need to be able to handle being interrupted, and continuing paging through files from a previous run.
Alternatively, you could use S3 events to trigger a simple Lambda that adds a row to a database. Then, again, some Lambda could run on a cron job that asks the database for old rows, and publishes that set to SQS for others to consume. That's slightly cleaner, maybe, and can handle scaling up to pretty big bucket sizes.
Or, you could do the paging through files, deciding what to do, and processing old files all on a t2.micro if you just need to do some simple work to a few dozen files every day.
It all depends on your workload and needs.
We're at the beginning stages of writing a AWS Lambda to copy massive amounts of s3 files within S3.
This Lambda will be triggered from S3.
Any advice about what the max triggers are that Lambda can handle at one time? for example, if we dump 10,000 trigger files in the S3 trigger folder, will Lambda handle this pretty well or will it throttle itself back enough to slow the whole thing down? Would there be a better recommended number?
I have noticed the triggers to Lambda (Works with SNS) are not scalable with high number of concurrent objects copied. There are other limits that takes this to a different issue. Lambda concurrency limits, Lambda Cold Start.
One way we got the triggers to work is to send it to SNS and forward it to lambda from there. It will be queued and delivered by SNS. You will see certain latency with the roundtrip to SNS.
Hope it helps.
EDIT1:
On the other hand, if you still want to retain the trigger without worrying about time. It worked fine with a 500milliseconds between each successful copy. So that it will not fire all trigger at once.
This was our testing first and worked successfully but took longer time. With SNS no throttling required.
I got this flow over AWS:
Put file on S3 -> trigger -> lambda function that inserts item to
DynamoDB -> see that I actually got new item ove DynamoDB
While I'm uploading few files (about 5-10) to S3, which triggering the lambda call, it takes time to see the expected results inside my DynamoDB.
It seems like there is a queue which being handeled behind the scenes of the S3 trigger. Becuase when i'm uploading few more files, those which didn't seen before are now presented as an item in DynamoDB.
My expected result is to see new Item in DynamoDB by each file(s) upload to S3 in the second it was made.
Is there a way to handle this issue using any configuration ?
I think the above scenario is related to "Concurrent Execution" in Lambda as you are trying to upload 5-10 files.
Every Lambda function is allocated with a fixed amount of specific
resources regardless of the memory allocation, and each function is
allocated with a fixed amount of code storage per function and per
account.
AWS Lambda Account Limits Per Region = 100 Default Limit
Limits
Concurrent Executions - Refer Event Based Sources (e.g. S3)
You can use the following formula to estimate your concurrent Lambda function invocations:
events (or requests) per second * function duration
For example, consider a Lambda function that processes Amazon S3 events. Suppose that the Lambda function takes on average three seconds and Amazon S3 publishes 10 events per second. Then, you will have 30 concurrent executions of your Lambda function.
To increase the limit:-
Refer "To request a limit increase for concurrent executions" section in above link.
AWS may automatically raise the concurrent execution limit on your
behalf to enable your function to match the incoming event rate, as in
the case of triggering the function from an Amazon S3 bucket.