I have project with the following workflow:
Pull files from server and upload to S3
When files hit S3, a message is sent to a topic using SNS
The lambda function subscribed to said topic will then process files by doing calculations
So far, I have not experienced any issues, but I wonder if this is a use case for SQS?
How many messages can my Lambda function handle all at once? If I am pulling hundreds of files, is SQS a necessity?
By default, parallel invocation limit is set for 1000.
You can change that limitation, but i never hit that number so far.
As soon as a lambda is done with consuming current request, it will be used for another, so if you upload 1000 files, you probably will only need about 100 lambdas, unless you need Minutes for one lambda to run.
The AWS handles the queued triggers, so even if you upload 100.000 files, they will be consumed asap, depending on diverse criteria.
You can test it with creating many little files and upload them all at once :)
For higher speed, upload them to different bucket, and simply move from bucket to bucket (speed is higher this way)
gl !
Related
We have an external service that uploads files to our S3 bucket in account A. We raise SNS notifications on each upload. A lambda function in account B subscribes to these notifications.
This works well for us, except that if the external service misses a configuration, it uploads >500 files together (in a single directory). And our lambda is triggered 500 times when that happens.
1. Is there a way to limit the number of files uploaded to a bucket within X minutes?
2. Is there a way to stop the lambda from getting invoked if it sees >500 SNS notifications together?
I am aware that placing an SQS between the Lambda and SNS will probably solve our problem. I want to know if there is another easier, more convenient way to solve this.
I explored the possibility of limiting the lambda concurrency so that is fails on throttling, however SNS notifications will be retried thrice (which is also a good thing and we don't want to lose this feature in case of other errors), so we do not want to do that.
Note that instant processing is not a hard requirement for us. We can wait for around 5 minutes to process the SNS notification.
No, it is not possible to limit uploads to Amazon S3 within a given time period.
Nor is it possible to stop Lambda being invoked if it sees more than a given quantity of Amazon SNS notifications.
I have an S3 bucket with different files. I need to read those files and publish SQS msg for each row in the file.
I cannot use S3 events as the files need to be processed with a delay - put to SQS after a month.
I can write a scheduler to do this task, read and publish. But can I was AWS for this purpose?
AWS Batch or AWS data pipeline or Lambda.?
I need to pass the date(filename) of the data to be read and published.
Edit : The data volume to be dealt is huge
I can think of two ways to do this entirely using AWS serverless offerings without even having to write a scheduler.
You could use S3 events to start a Step Function that waits for a month before reading the S3 file and sending messages through SQS.
With a little more work, you could use S3 events to trigger a Lambda function which writes the messages to DynamoDB with a TTL of one month in the future. When the TTL expires, you can have another Lambda that listens to the DynamoDB streams, and when there’s a delete event, it publishes the message to SQS. (A good introduction to this general strategy can be found here.)
While the second strategy might require more effort, you might find it less expensive than using Step Functions depending on the overall message throughput and whether or not the S3 uploads occur in bursts or in a smooth distribution.
At the core, you need to do two things:
Enumerate all of the object in a bucket in S3, and perform some action on any object uploaded more than a month ago.
Can you use Lambda or Batch to do this? Sure. A Lambda could be set to trigger once a day, enumerate the files, and post the results to SQS.
Should you? No clue. A lot depends on your scale, and what you plan to do if it takes a long time to perform this work. If your S3 bucket has hundreds of objects, it won't be a problem. If it has billions, your Lambda will need to be able to handle being interrupted, and continuing paging through files from a previous run.
Alternatively, you could use S3 events to trigger a simple Lambda that adds a row to a database. Then, again, some Lambda could run on a cron job that asks the database for old rows, and publishes that set to SQS for others to consume. That's slightly cleaner, maybe, and can handle scaling up to pretty big bucket sizes.
Or, you could do the paging through files, deciding what to do, and processing old files all on a t2.micro if you just need to do some simple work to a few dozen files every day.
It all depends on your workload and needs.
I set up my AWS workflow so that my lambda function will be triggered when a text file is added to my S3 bucket, and generally, it worked fine - when I upload a bunch of text files to the S3 bucket, a bunch of lambda will be running at the same time and process each text file.
But my issue is that occasionally, 1 or 2 files (out of 20k or so in total) did not trigger the lambda function as expected. I have no idea why - when I checked the logs, it's NOT that the file is processed by the lambda but failed. The log showed that the lambda was not trigger by that 1 or 2 files at all. I don't believe it's reaching the 1000 concurrent lambda limitation as well since my function runs faster and the peak is around 200 lambdas.
My question is: is this because AWS lambda does not guarantee it will be triggered 100%? Like the S3, there is always a (albeit tiny) possibility of failure? If not, how can I debug and fix this issue?
You don't mention how long the Lambdas take to execute. The default limit of concurrent executions is 1000. If you are uploading files faster than they can be processed with 1000 Lambdas then you'll want to reach out to AWS support and get your limit increased.
Also from the docs:
Amazon S3 event notifications typically deliver events in seconds but can sometimes take a minute or longer. On very rare occasions, events might be lost.
If your application requires particular semantics (for example, ensuring that no events are missed, or that operations run only once), we recommend that you account for missed and duplicate events when designing your application. You can audit for missed events by using the LIST Objects API or Amazon S3 Inventory reports. The LIST Objects API and Amazon S3 inventory reports are subject to eventual consistency and might not reflect recently added or deleted objects.
I have one Lambda it is executed on s3 put item trigger.
Now in s3 any objects uploaded lambda is triggering..
Let say Some one uploaded 5 files in s3 so each time it will execute the lambda for 5 files...
Is there any way that lambda can trigger only one time for all those 5 files...
Can I trace after complete of 5 time triggers/lambda execution...How many minutes lambda is not executing as no files uploaded..
Any help will really helpful for me
If you have the bucket notification configured for object create (s3:ObjectCreated) and if you haven't specified a filter or filter satisfies the uploaded objects your lambda will get triggered each time for per uploaded object.
To see the number of invocations, you can look at the Invocations metric for your lambda function in Cloudwatch metrics
You may want to add a queue that will handle the requests to process new files on S3.
A relevant one could be Kinesis data stream / SQS. If batching is important to you, Kinesis will be probably better.
The requests can be sent by a lambda triggered by S3 as you described, but it will only send the request to the queue, and another lambda will then process it. A simpler way will be to send the request in the same code that puts the object in S3 (if possible).
This way you can have statistics of how many requests were sent, processed, waiting, etc.
We're at the beginning stages of writing a AWS Lambda to copy massive amounts of s3 files within S3.
This Lambda will be triggered from S3.
Any advice about what the max triggers are that Lambda can handle at one time? for example, if we dump 10,000 trigger files in the S3 trigger folder, will Lambda handle this pretty well or will it throttle itself back enough to slow the whole thing down? Would there be a better recommended number?
I have noticed the triggers to Lambda (Works with SNS) are not scalable with high number of concurrent objects copied. There are other limits that takes this to a different issue. Lambda concurrency limits, Lambda Cold Start.
One way we got the triggers to work is to send it to SNS and forward it to lambda from there. It will be queued and delivered by SNS. You will see certain latency with the roundtrip to SNS.
Hope it helps.
EDIT1:
On the other hand, if you still want to retain the trigger without worrying about time. It worked fine with a 500milliseconds between each successful copy. So that it will not fire all trigger at once.
This was our testing first and worked successfully but took longer time. With SNS no throttling required.