I have a task where on a scheduled basis need to check number of files in a bucket (files are uploaded via a NAS) and then e-mail the total number using SES.
The e-mail part on its own is working fine. However, since I have over 40 000 files in the bucket it takes over 5 mins or more to return the count of total number of files.
From an design perspective, is it better to put this part of the logic in an EC2 machine and then schedule the action on the ec2? Or are there better ways to do this?
Note, I don't have to list all the files. I simply want to get a total count of all the files in the bucket.
How about having a lambda triggered every time a file is put/delete/etc
and according to the event received, lambda updates one DynamoDb table which is storing the numbers.
e.g.
In case, file is added to S3, lambda will increase the count in DynamoDb table by 1
and in case of file delete lambda will decrease the count
So this way, I guess, you will always have the latest count without even counting the files.
You did not mention how often you need to do this file count.
If it is daily or less often, you can activate Amazon S3 Inventory. It can provide a daily dump of all files in a bucket, from which you could perform a count.
Related
I have a development where I can send files to my folder in s3 one by one, but i would like to know if this can be done in batch since i would like to send many files in a single transaction so as not to exceed the control limits of the sales force.
New to AWS and found out that with AWS-SDK I can't get multiple objects of S3 at one request.
I could loop the get request, but that would take a long time with a single function.
I heard that Lambda can run multiple functions at once and that SQS could help me with that.
So how would you set up a Lambda and SQS system that sums all digits found in all files of a S3 bucket?
In example, if I have 6000 files in a bucket, a first lambda will count them, then send a message to SQS with the number of files, then SQS will trigger a lambda that will run until just before it times out, pass the sum of digits found in the files it read with a message to SQS that will trigger the next lambda passing it the sum and the last index it read, and so on until all files are read and summed - the last lambda will return the total sum
Maybe better - the first lambda will fire several parallel lambdas that will each upon completion add to a sum somewhere, and in the end the sum will be returned to me. If this sounds logical
I heard that Lambda can run multiple functions at once
Lambda can run multiple instances at once, but something need to execute the functions and aggregate the results
and that SQS could help me with that.
SQS can help with a lot of things, but in this case I don't see any reasonable usage
So how would you set up a... system that sums all digits found in all files of a S3 bucket?
If you have a LOT of data and you want to process them in parallel, by default you could use an ERM / Spark cluster. To keep it simple, assuming you have the S3 data in reasonable (supported) format, I'd personally use AWS Athena which is basically a serverless analytics service.
I've got a bucket that will receive a random amount of files within an unknown timeframe.
This could be anything from 1 file in 5 hours to 1000 files within 1 minute...
I want to invoke a lambda function when the bucket has new files but I don't really care about the content of the S3 event the lambda gets passed. Is there something that will allow me to call the lambda a single time if there a new files within the last 10 minutes without setting up something cron-like that runs every 10 minutes and checks for new files? I really only want to execute this a single time and only if there are new files.
You can create a CloudWatch Alarm that monitors the Amazon S3 request metrics and fires whenever the number of HTTP PUT requests made for objects in the bucket is greater than zero within a period of ten minutes.
I have some script which generate 4 csv files per day in a AWS S3 Bucket. I am trying to create a Alarm in Amazon AWS with Cloudwatch to find if on any given day less than 4 files are generated in that particular S3 bucket. I tried to create a alarm but the alarm surprisingly had sum and other options but no option to have specific number check per given time amount (say 24 hours).
P.S. I have seen the average function in alarm but it does not give daily average of object created in the bucket.
Is it possible to create the alarm in a way I need ? I tried googling but didnt found exact solution for this problem.
You can use the 'PutRequests' metric to create your alarm.
This metric provides the number of times PUT api was called on the S3 bucket (In your case 4).
Set the Statistic to Sum
Period is 1 day
Threshold as Lower than 4.
Attaching an example screenshot for you to refer.
Trying to sync a large (millions of files) S3 bucket from cloud to local storage seems to be troublesome process for most S3 tools, as virtually everything I've seen so far uses GET Bucket operation, patiently getting the whole list of files in bucket, then diffing it against a list local of files, then performing the actual file transfer.
This looks extremely unoptimal. For example, if one could list files in a bucket that were created / changed since the given date, this could be done quickly, as list of files to be transferred would include just a handful, not millions.
However, given that answer to this question is still true, it's not possible to do so in S3 API.
Are there any other approaches to do periodic incremental backups of a given large S3 bucket?
On AWS S3 you can configure event notifications (Ex: s3:ObjectCreated:*). To request notification when an object is created. It supports SNS, SQS and Lambda services. So you can have an application that listens on the event and updates the statistics. You may also want to ad timestamp as part of the statistic. Then just "query" the result for a certain period of time and you will get your delta.