I've got a bucket that will receive a random amount of files within an unknown timeframe.
This could be anything from 1 file in 5 hours to 1000 files within 1 minute...
I want to invoke a lambda function when the bucket has new files but I don't really care about the content of the S3 event the lambda gets passed. Is there something that will allow me to call the lambda a single time if there a new files within the last 10 minutes without setting up something cron-like that runs every 10 minutes and checks for new files? I really only want to execute this a single time and only if there are new files.
You can create a CloudWatch Alarm that monitors the Amazon S3 request metrics and fires whenever the number of HTTP PUT requests made for objects in the bucket is greater than zero within a period of ten minutes.
Related
A batch job needs to process over 4000 files stored in S3 bucket, these files are stored across 36 different prefix and each of these prefix will have internally have 4 sub prefix each.
Basically 36 root folders and each root folders having 4 subfolders which will be recreated in S3 bucket using prefixes. Each of these files location will be
stored in database which will be ready by lambda.
I am planning to use lambda here and want to process 6 root folders( S3 prefix) in 6 concurrent lambda, meaning I want to run same lambda concurrently.
I want to create a initializer lambda which will read all configuration of these folder structure from RDS and it will push 6 different SQS messages,
with each sqs message containing names of 6 root folders(S3 prefix) , i will create another lambda called processing lambda which will read SQS messages and starts processing 6 folders sequentially
within in the lambda.
Below are my questions:-
With 6 SQS message will the processing lambda be executed in concurrent mode?
Assuming average size of files is around 50kb , lambda will have read each of these files in memory will this be an issue for processing in lambda
The SLA for job is around 10 min and the Batch job gets triggered every 20 min from 8am to 8pm, is Lambda right option or should this done using ECS
You might have a look into distributed map offered my stepfunctions. I think it perfectly fits your need.
https://aws.amazon.com/blogs/aws/step-functions-distributed-map-a-serverless-solution-for-large-scale-parallel-data-processing/
I have one Lambda it is executed on s3 put item trigger.
Now in s3 any objects uploaded lambda is triggering..
Let say Some one uploaded 5 files in s3 so each time it will execute the lambda for 5 files...
Is there any way that lambda can trigger only one time for all those 5 files...
Can I trace after complete of 5 time triggers/lambda execution...How many minutes lambda is not executing as no files uploaded..
Any help will really helpful for me
If you have the bucket notification configured for object create (s3:ObjectCreated) and if you haven't specified a filter or filter satisfies the uploaded objects your lambda will get triggered each time for per uploaded object.
To see the number of invocations, you can look at the Invocations metric for your lambda function in Cloudwatch metrics
You may want to add a queue that will handle the requests to process new files on S3.
A relevant one could be Kinesis data stream / SQS. If batching is important to you, Kinesis will be probably better.
The requests can be sent by a lambda triggered by S3 as you described, but it will only send the request to the queue, and another lambda will then process it. A simpler way will be to send the request in the same code that puts the object in S3 (if possible).
This way you can have statistics of how many requests were sent, processed, waiting, etc.
I have a use case where some process puts a file every 6 hours to an S3 bucket. This bucket has already thousands of files in it and I wanted to generate an sns alert or something if no new file is added in the last 7 hours. what would be a reasonable approach?
Thanks
There are a few potential approaches:
Check the bucket every few minuter
Keep track of the last new file
Use an Amazon CloudWatch Alarm
Check the bucket every few minutes
Configure Amazon CloudWatch Events to trigger an AWS Lambda function every few minutes (depending upon how quickly you want it reported), which obtains a listing of the bucket and check the timestamp that the last object was added. If it is more than 7 hours, send the alert.
This approach is very simple but is doing a lot of work every few minutes, including during the 7 hours after an object was added. Plus, if you have lots of objects, this can consume a lot of Lambda time and API calls.
Keep track of the last new file
Configure an Event on the Amazon S3 bucket to trigger an AWS Lambda function whenever a new file is added to the bucket. Store the current time in a DynamoDB table (or, if you really want to save costs, store it in the Systems Manager Parameter Store or an S3 object in another bucket). This will update the date whenever a new file is added.
Configure Amazon CloudWatch Events to trigger an AWS Lambda function every few minutes (depending upon how quickly you want it reported) that checks the "last updated date" in DynamoDB (or where ever it was stored). If it is more than 7 hours, trigger an alert.
While this approach has more components, it is actually a simpler solution because it never has to look through the list of objects in S3. Instead, it just remembers when the last object was added.
You could come up with an even smarter method that, instead of checking every few minutes, schedules an alert function in 7 hours time. Whenever a new file is added, it changes the schedule to put it 7 hours away again. It's like constantly delaying a dentist appointment. :)
Use an Amazon CloudWatch Alarm
This is a simpler method that uses a CloudWatch Alarm to trigger the notification.
Configure the S3 bucket to trigger a Lambda function whenever an object is added. The Lambda function sends a Custom Metric to Amazon CloudWatch.
Create a CloudWatch Alarm to trigger a notification whenever the SUM of the Custom Metric is zero for the past 6 hours. Also configure it to trigger if the Alarm enters the INSUFFICIENT_DATA state, so that it correctly triggers when no data is sent (which is more likely than a metric of zero since the Lambda function won't send data when no objects are created).
The only downside is that the alarm period only has a few options. It can be set for 6 hours, but I don't think it can be set for 7 hours.
How to alert
As to how to alert somebody, sending a message to an Amazon SNS topic is a good idea. People could subscribe via Email, SMS and various other methods.
The Amazon CloudWatch Alarm method described by #John Rotenstein is definitely the simplest option for most use cases and works well. Just one thing to be aware of: CloudWatch Alarms has a 24hr limit per metric (EvaluationPeriods * Period must be <= 86400s). Therefore, if you expect your bucket to receive files less than once per day then you'll need to use a different method.
I have a task where on a scheduled basis need to check number of files in a bucket (files are uploaded via a NAS) and then e-mail the total number using SES.
The e-mail part on its own is working fine. However, since I have over 40 000 files in the bucket it takes over 5 mins or more to return the count of total number of files.
From an design perspective, is it better to put this part of the logic in an EC2 machine and then schedule the action on the ec2? Or are there better ways to do this?
Note, I don't have to list all the files. I simply want to get a total count of all the files in the bucket.
How about having a lambda triggered every time a file is put/delete/etc
and according to the event received, lambda updates one DynamoDb table which is storing the numbers.
e.g.
In case, file is added to S3, lambda will increase the count in DynamoDb table by 1
and in case of file delete lambda will decrease the count
So this way, I guess, you will always have the latest count without even counting the files.
You did not mention how often you need to do this file count.
If it is daily or less often, you can activate Amazon S3 Inventory. It can provide a daily dump of all files in a bucket, from which you could perform a count.
I got this flow over AWS:
Put file on S3 -> trigger -> lambda function that inserts item to
DynamoDB -> see that I actually got new item ove DynamoDB
While I'm uploading few files (about 5-10) to S3, which triggering the lambda call, it takes time to see the expected results inside my DynamoDB.
It seems like there is a queue which being handeled behind the scenes of the S3 trigger. Becuase when i'm uploading few more files, those which didn't seen before are now presented as an item in DynamoDB.
My expected result is to see new Item in DynamoDB by each file(s) upload to S3 in the second it was made.
Is there a way to handle this issue using any configuration ?
I think the above scenario is related to "Concurrent Execution" in Lambda as you are trying to upload 5-10 files.
Every Lambda function is allocated with a fixed amount of specific
resources regardless of the memory allocation, and each function is
allocated with a fixed amount of code storage per function and per
account.
AWS Lambda Account Limits Per Region = 100 Default Limit
Limits
Concurrent Executions - Refer Event Based Sources (e.g. S3)
You can use the following formula to estimate your concurrent Lambda function invocations:
events (or requests) per second * function duration
For example, consider a Lambda function that processes Amazon S3 events. Suppose that the Lambda function takes on average three seconds and Amazon S3 publishes 10 events per second. Then, you will have 30 concurrent executions of your Lambda function.
To increase the limit:-
Refer "To request a limit increase for concurrent executions" section in above link.
AWS may automatically raise the concurrent execution limit on your
behalf to enable your function to match the incoming event rate, as in
the case of triggering the function from an Amazon S3 bucket.