AWS S3 deletion of files that haven't been accessed - amazon-web-services

I'm writing a service that takes screenshots of a lot of URLs and saves them in a public S3 bucket.
Due to storage costs, I'd like to periodically purge the aforementioned bucket and delete every screenshot that hasn't been accessed in the last X days.
By "accessed" I mean downloaded or acquired via a GET request.
I checked out the documentation and found a lot of ways to define an expiration policy for an S3 object, but couldn't find a way to "mark" a file as read once it's been accessed externally.
Is there a way to define the periodic purge without code (only AWS rules/services)? Does the API even allow that or do I need to start implementing external workarounds?

You can use Amazon S3 Storage Class Analysis:
By using Amazon S3 analytics storage class analysis you can analyze storage access patterns to help you decide when to transition the right data to the right storage class. This new Amazon S3 analytics feature observes data access patterns to help you determine when to transition less frequently accessed STANDARD storage to the STANDARD_IA (IA, for infrequent access) storage class.
After storage class analysis observes the infrequent access patterns of a filtered set of data over a period of time, you can use the analysis results to help you improve your lifecycle policies.
Even if you don't use it to change Storage Class, you can use it to discover which objects are not accessed frequently.

There is no such service provided by AWS.. You will have to write your own solution.

Related

Will there be any impact in AWS Athena, if we change the S3 Storage class?

In our organization, we facing cost issues due to overload of S3 buckets. Too many junk files and archives are stored which are causing this issue.
I recently got an approval to work on Lifecycle policy in AWS S3. Before I start to work on this, I need to clarify that our Athena databases have their storage in one of the S3 buckets.
If we change the storage class, will that impact the Athena database queries?
That depends on the storage class.
If you archive data into Glacier, Athena won't be able to read it and just ignores it. For the other storage classes, e.g. the infrequently accessed ones Athena can still read them, but your costs will increase if they're read at least once per month.

How to limit the amount of data stored by user via S3 buckets on AWS?

I'm creating a platform whereby users upload data to us. We check the data to make sure it's safe and correctly formatted, and then store the data in buckets, tagging by user.
The size of the data upload is normally around 100MB. This is large enough to be concerning.
I'm worried about cases where certain users may try to store an unreasonable about of data on the platform, i.e. they make 1000s of transactions within a short period of time.
How do cloud service providers allow site admins to monitor the amount of data stored per user?
How is this actively monitored?
Any direction/insight appreciated. Thank you.
Amazon S3 does not have a mechanism for limiting data storage per "user".
Instead, your application will be responsible for managing storage and for defining the concept of an "application user".
This is best done by tracking data files in a database, including filename, owner, access permissions, metadata and lifecycle rules (eg "delete after 90 days"). Many applications allow files to be shared between users, such as a photo sharing website where you can grant view-access of your photos to another user. This is outside the scope of Amazon S3 as a storage service, and should be handled within your own application.
If such data is maintained in a database, it would be trivial to run a query to identify "data stored per user". I would recommend against storing such information as metadata against the individual objects in Amazon S3 because there is no easy way to query metadata across all objects (eg list all objects associated with a particular user).
Bottom line: Amazon S3 will store your data and keep it secure. However, it is the responsibility of your application to manage activities related to users.

How to change storage class In S3 the fastest way

I have around 7 TB of data in a folder in Amazon S3. I want to change the storage class from standard to one zone IA. But when it's done via UI its taking too long, might even take whole day. What's the fastest way to change the storage class?
You can create a Lifecycle Policy for an S3 Bucket.
This can automatically change the storage class for objects older than a given number of days.
So, this is the "fastest" way for you to request the change.
However, the Lifecycle policy might take up to 24-48 hours to complete, so it might not be the "fastest" to have all the objects transitioned.
You can do it different ways:
Via the console as you experienced
Via lifecycle management
Via AWS cli
Via AWS SDK (if you know any of the programming language)
You can also change the storage class of an object that is already stored in Amazon S3 to any other storage class by making a copy of the object using the PUT Object - Copy API.
You copy the object in the same bucket using the same key name and specify request headers as follows:
Set the x-amz-metadata-directive header to COPY.
Set the x-amz-storage-class to the storage class that you want to use.
In a versioning-enabled bucket, you cannot change the storage class of a specific version of an object. When you copy it, Amazon S3 gives it a new version ID.
Option 4 would be the fastest way in my case (as a developer). Looping through all the objects and copy them with the correct storage class.
Hope it helps!

Using AWS Glacier as back-up

I have a website where I serve content that is stored on an AWS S3 bucket. As the amount of content grows, I have started thinking about back-up options. Using AWS Glacier came up as a natural route.
After reading on it, I didn't understand if it does what I intend to do with it. From what I have understood, using Glacier, you set lifecycle policies on objects stored on your S3 buckets. According to these policies, objects will be transferred Glacier and deleted from your S3 bucket at a specific point in time after they have been uploaded to S3. At this point, the object's storage class changes to 'GLACIER'. Amazon explains that, once this is done, you can no longer access the objects through S3 but "their index entry will remain as is". Simultaneously, they say that retrieval of objects from Glacier takes 3-5 hours.
My question is: Does this mean that, once objects are transferred to Glacier, I will not be able to serve them on my website without retrieving them first? Or does it mean that they will still be served from the S3 bucket as usual but that, in case something happens with the files on S3 I will just be able to retrieve them in 3-5 hours? Glacier would only be a viable back up solution for me if users of my website would still be able to load content on the website after the correspondent objects are transferred to Glacier. Also, is it possible to have objects transferred to Glacier without them being deleted from the S3 bucket?
Thank you
To answer your question: Does this mean that, once objects are transferred to Glacier, I will not be able to serve them on my website without retrieving them first?
No, you won't be able to serve them on your website unless transfer them from glacier to standard or standard_IA class, which is taken 3-5 hours. Glacier is generally used to archive cold data like old logs which is accessed in rare condition. So if you need real-time access to the object, Glacier isn't a valid option for you.

can i meter or set a size limit to an s3 folder

I'd like to set up a separate s3 bucket folder for each of my mobile app users for them to store their files. However, I also want to set up size limits so that they don't use up too much storage. Additionally, if they do go over the limit I'd like to offer them increased space if they sign up for a premium service.
Is there a way I can set folder file size limits through s3 configuration or api? If not would I have to use the apis somehow to calculate folder size on every upload? I know that there is the devpay feature in Amazon but it might be a hassle for users to sign up with Amazon if they want to just use small amount of free space.
There does not appear to be a way to do this, probably at least in part because there is actually no such thing as "folders" in S3. There is only the appearance of folders.
Amazon S3 does not have concept of a folder, there are only buckets and objects. The Amazon S3 console supports the folder concept using the object key name prefixes.
— http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
All of the keys in an S3 bucket are actually in a flat namespace, with the / delimiter used as desired to conceptually divide objects into logical groupings that look like folders, but it's only a convenient illusion. It seems impossible that S3 would have a concept of the size of a folder, when it has no actual concept of "folders" at all.
If you don't maintain an authoritative database of what's been stored by clients (which suggests that all uploads should pass through an app server rather than going directly to S3, which is the the only approach that makes sense to me at all) then your only alternative is to poll S3 to discover what's there. An imperfect shortcut would be for your application to read the S3 bucket logs to discover what had been uploaded, but that is only provided on a best-effort basis. It should be reliable but is not guaranteed to be perfect.
This service provides a best effort attempt to log all access of objects within a bucket. Please note that it is possible that the actual usage report at the end of a month will slightly vary.
Your other option is to develop your own service that sits between users and Amazon S3, that monitors all requests to your buckets/objects.
— http://aws.amazon.com/articles/1109#13
Again, having your app server mediate all requests seems to be the logical approach, and would also allow you to detect immediately (as opposed to "discover later") that a user had exceeded a threshold.
I would maintain a seperate database in the cloud to hold each users total hdd usage count. Its easy to manage the count via S3 Object Lifecycle Events which could easily trigger a Lambda which in turn writes to a DB.