Does AWS S3 Have a concept of files being 'updated'? - amazon-web-services

I'd like to write a Lambda function that is triggered when files are added or modified in an s3 bucket and processes them and moves them elsewhere, clobbering older versions of the files.
I'm wondering if AWS Lambda can be configured to trigger when files are updated?
After reviewing the Boto3 documentation for s3 it looks like the only things that could happen in a s3 bucket would be creations and deletions.
Additionally, the AWS documentation seems to indicate there is no way to trigger things on 'updates' to S3.
Am I correct in thinking there is no real concept of an 'update' to a file in S3 and that an update would actually be when something was destroyed and recreated? If I'm mistaken, how can I trigger a Lambda function when an S3 file is changed in a bucket?

No, there is no concept of updating a file on S3. A file on S3 is updated the same way it is uploaded in the first place - through a PUT object request. (Relevant answer here.) An S3 bucket notification configured to trigger on a PUT object request can execute a Lambda function.

There is now a new functionality for S3 buckets. Under properties there is the possibility to enable versioning for this bucket. And if you set a trigger for creating on S3 assigned to your Lambda function - this will executed every time if you 'update' the same file as it is a new version.

Related

AWS S3 sync command from one bucket to another

I want to use the AWS S3 sync command to sync a large bucket with another bucket.
I found this answer that say that the files from the bucket synced over the AWS backbone and are not copied to the local machine but I can't find a reference anywhere in the documentation. Does anyone has a proof for this behavior? any formal documentation that explains how it works?
I tried to find something in the documentation but nothing there.
To learn more about the sync command, check CLI docs. You can directly refer to the section named -
Sync from S3 bucket to another S3 bucket
The following sync command syncs objects to a specified bucket and
prefix from objects in another specified bucket and prefix by copying
s3 objects. An s3 object will require copying if one of the following
conditions is true:
The s3 object does not exist in the specified bucket and prefix
destination.
The sizes of the two s3 objects differ.
The last modified time of the source is newer than the last modified time of the destination.
Use the S3 replication capability if you only want to replicate the data that moves from bucket1 to bucket2.

How to set up directory level triggers in AWS S3 for Lambda?

I have a directory structure as shown below
S3 Bucket
-logs/
-product1_log.txt
-product2_log.txt
-images/
-products/
There are a couple of directories mentioned above in the S3 bucket, now whenever a new file gets added to the logs folder, I have a lambda function that updates the timestamp in my MongoDB.
Requirement
Trigger lambda function only when logs folder gets updated, update to other folders should not trigger the lambda
Exact same use case described in the below link.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-filtering.html

Aws s3 trigger lambda on new files that are NOT temporary files

i want to launch a lambda for any new complete file, the process is quite simple:
i upload files to s3
for every new files in the directory I launch a lambda
unfortunately, i see that my lambdas invokes on _temporary/* files, which are files that are not fully uploaded to s3. what can i do?
thanks!
There is no concept of a "partially uploaded" file in Amazon S3. Either the whole object is created, or the object is not created. Nor does Amazon S3 have a concept of "temporary" or _temporary/ files. If they exist, it is because your application is uploading those files.
When creating an Amazon S3 event, you can specify a Prefix. The event will only be triggered for objects matching that prefix.
Alternatively, you could add a line of code at the start of the AWS Lambda function that checks the Key of the object and exits if it does not need to perform any actions on the object.

Identifying and deleting S3 Objects that are not being accessed?

I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?
There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.
There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs

Run AWS lambda function on existing S3 images

I write an AWS lambda function in Node.js for image resizing and trigger it when images upload.
I have already more than 1,000,000 images existing in bucket.
I want to run this lambda function on that images but not find anything till yet.
How can I run AWS lamdba function on existing images of S3 bucket?
Note:- I know this question already asked on Stack overflow, but issue is that no solution of them given till yet
Unfortunately, Lambda cannot be triggered automatically for objects that are already existing in a S3 bucket.
You will have to invoke your Lambda function manually for each image in your S3 bucket.
First, you will need to list existing objects in your S3 bucket using the ListObjectsV2 action.
For each object in your S3 bucket, you must then invoke your Lambda function and provide the S3 object's information as the Payload.
Yes , it's completely true that lambda cannot be triggered by objects already present there in your s3 bucket, but invoking your lambda manually for each object is a completely dumb idea.
With some clever techniques you can perform your tasks on those images easily :
The hard way is, make a program locally that exactly does the same thing as your lambda function but add two more things, firstly you have to iterate over each object in your bucket, then perform your code on it and then save it to destination path of s3 after resizing. i.e, for all images already stored in your s3 bucket , instead of using lambda, you are resizing the images locally in your computer and saving them back to s3 destination.
The easiest way is, first make sure that you have configured s3 notification's event type to be Object Created (All) as trigger for your lambda.
Then after this, move all your already stored images to a new temporary bucket, and then move those images back to the original bucket, this is how your lambda will get triggered for each image automatically. You can do the moving task easily by using sdk's provided by AWS. For example, for moving files using boto3 in python, you can refer this link to moving example in python using boto3
Instead of using moving , i.e cut and paste , you can use copy and paste commands too.
In addition to Mausam Sharma's comment you can run the copy between buckets using the aws cli:
aws s3 sync s3://SOURCE-BUCKET-NAME s3://DESTINATION-BUCKET-NAME --source-region SOURCE-REGION-NAME --region DESTINATION-REGION-NAME
from here:
https://medium.com/tensult/copy-s3-bucket-objects-across-aws-accounts-e46c15c4b9e1
You can simply copy back to the same bucket with the CLI which will replace the original file with itself and then run the lambda as a result.
aws s3 copy s3://SOURCE-BUCKET-NAME s3://SOURCE-BUCKET-NAME --recursive
You can also include/exclude glob patterns which can be used to selectively run against say a particular day, or specific extensions etc.
aws s3 copy s3://SOURCE-BUCKET-NAME s3://SOURCE-BUCKET-NAME --recursive --exclude "*" --include "2020-01-15*"
It's worth noting that like many of the other answers here, this will incur costs on s3 for read/write etc, so cautiously apply this in the event of buckets containing lots of files.